[Swift-user] Block task failed: Connection to worker lost

Jonathan Ozik xio247 at gmail.com
Thu Dec 4 10:48:41 CST 2014


Thanks Yadu,

I have a few questions.
- How do I invoke swift and pass it the new swift.conf?
- What is the “restart” procedure?
- Is there a module I can load to use the latest swift trunk?

Jonathan

> On Dec 3, 2014, at 7:03 PM, Yadu Nand Babuji <yadunand at uchicago.edu> wrote:
> 
> Hi Jonathan,
> 
> I believe some of the issues related to timeouts seen in your logs are fixed/less likely in trunk
> and would recommend that you try a run with that. I've also converted your swift.properties to
> the new swift.conf format. You can get a tested .conf file along with a small test case from here:
> 
> http://users.rcc.uchicago.edu/~yadunand/test_configs_package.tar.gz <http://users.rcc.uchicago.edu/~yadunand/test_configs_package.tar.gz>
> 
> Here are some changes I've made to the conf:
> lazyErrors: true and executionRetries: 0 so that long running jobs are not retried.
> staging set to direct, since you are running on the shared FS.
> added worker logging and an app definition for debug.
> 
> You can get the latest trunk build from here : http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz <http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz>
> 
> Thanks,
> Yadu
> 
> On 12/03/2014 01:16 PM, Jonathan Ozik wrote:
>> Hi Yadu,
>> 
>> The tar.gz archive is here: https://www.dropbox.com/s/tt3ewapzaf0ygac/run001.tar.gz?dl=0 <https://www.dropbox.com/s/tt3ewapzaf0ygac/run001.tar.gz?dl=0>
>> I’m also attaching the swift.properties file that I used below.
>> 
>> Thank you,
>> 
>> Jonathan
>> 
>>> On Dec 3, 2014, at 11:04 AM, Yadu Nand Babuji <yadunand at uchicago.edu <mailto:yadunand at uchicago.edu>> wrote:
>>> 
>>> Hi Jonathan,
>>> 
>>> The issue you are seeing sounds pretty close to what David reported a 
>>> while back.
>>> Could you send us a tar ball of your run directory from a failed run ?
>>> 
>>> Could you also check if you've set lowOverAllocation and 
>>> highOverAllocation in your sites definition ?
>>> 
>>> Thanks,
>>> Yadu
>>> 
>>> On 12/03/2014 10:50 AM, Ozik, Jonathan wrote:
>>>> Hi all,
>>>> 
>>>> I’m trying to run a large set of simulations on Midway using Swift 0.95-RC5.
>>>> 768 of the 2187 tasks completed successfully and then I got the exception:
>>>> 
>>>> exception @ swift-int.k, line: 530
>>>> Caused by: Block task failed: Connection to worker lost
>>>> org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=141203-145449.325, now=141203-145649.844, channel=TCPChannel [type: server, contact: 1202-5410010-000072-000000]
>>>> at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133)
>>>> at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124)
>>>> at java.util.TimerThread.mainLoop(Timer.java:555)
>>>> at java.util.TimerThread.run(Timer.java:505)
>>>> 
>>>> Progress: Wed, 03 Dec 2014 14:59:51+0000  Submitted:651  Failed:6  Finished successfully:768  Failed but can retry:762
>>>> Progress: Wed, 03 Dec 2014 14:59:52+0000  Submitted:651  Failed:44  Finished successfully:768  Failed but can retry:724
>>>> 
>>>> And the process seems to have stopped.
>>>> 
>>>> What log file would be helpful for diagnosing this?
>>>> 
>>>> Jonathan
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Swift-user mailing list
>>>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user <https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user>
>>> 
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user <https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user>
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20141204/751fa50d/attachment.html>


More information about the Swift-user mailing list