[Swift-user] Block task failed: Connection to worker lost

Yadu Nand Babuji yadunand at uchicago.edu
Wed Dec 3 19:03:30 CST 2014


Hi Jonathan,

I believe some of the issues related to timeouts seen in your logs are 
fixed/less likely in trunk
and would recommend that you try a run with that. I've also converted 
your swift.properties to
the new swift.conf format. You can get a tested .conf file along with a 
small test case from here:

http://users.rcc.uchicago.edu/~yadunand/test_configs_package.tar.gz

Here are some changes I've made to the conf:
lazyErrors: true and executionRetries: 0 so that long running jobs are 
not retried.
staging set to direct, since you are running on the shared FS.
added worker logging and an app definition for debug.

You can get the latest trunk build from here : 
http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz

Thanks,
Yadu

On 12/03/2014 01:16 PM, Jonathan Ozik wrote:
> Hi Yadu,
>
> The tar.gz archive is here: 
> https://www.dropbox.com/s/tt3ewapzaf0ygac/run001.tar.gz?dl=0
> I’m also attaching the swift.properties file that I used below.
>
> Thank you,
>
> Jonathan
>
>> On Dec 3, 2014, at 11:04 AM, Yadu Nand Babuji <yadunand at uchicago.edu 
>> <mailto:yadunand at uchicago.edu>> wrote:
>>
>> Hi Jonathan,
>>
>> The issue you are seeing sounds pretty close to what David reported a
>> while back.
>> Could you send us a tar ball of your run directory from a failed run ?
>>
>> Could you also check if you've set lowOverAllocation and
>> highOverAllocation in your sites definition ?
>>
>> Thanks,
>> Yadu
>>
>> On 12/03/2014 10:50 AM, Ozik, Jonathan wrote:
>>> Hi all,
>>>
>>> I’m trying to run a large set of simulations on Midway using Swift 
>>> 0.95-RC5.
>>> 768 of the 2187 tasks completed successfully and then I got the 
>>> exception:
>>>
>>> exception @ swift-int.k, line: 530
>>> Caused by: Block task failed: Connection to worker lost
>>> org.globus.cog.coaster.TimeoutException: Channel timed out. 
>>> lastTime=141203-145449.325, now=141203-145649.844, 
>>> channel=TCPChannel [type: server, contact: 1202-5410010-000072-000000]
>>> at 
>>> org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133)
>>> at 
>>> org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124)
>>> at java.util.TimerThread.mainLoop(Timer.java:555)
>>> at java.util.TimerThread.run(Timer.java:505)
>>>
>>> Progress: Wed, 03 Dec 2014 14:59:51+0000  Submitted:651  Failed:6 
>>>  Finished successfully:768  Failed but can retry:762
>>> Progress: Wed, 03 Dec 2014 14:59:52+0000  Submitted:651  Failed:44 
>>>  Finished successfully:768  Failed but can retry:724
>>>
>>> And the process seems to have stopped.
>>>
>>> What log file would be helpful for diagnosing this?
>>>
>>> Jonathan
>>>
>>>
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20141203/a011ce9e/attachment.html>


More information about the Swift-user mailing list