[Swift-user] Block task failed: Connection to worker lost
Yadu Nand Babuji
yadunand at uchicago.edu
Thu Dec 4 11:14:51 CST 2014
Hi Jonathan,
If your config file is named swift.conf and is in the current directory,
it will be automatically selected and you needn't specify
the file on the commandline, otherwise specify the config file using the
-config option:
swift -config <path_to_config> <your_script.swift>
To resume from the log, say the restart.log in your run001 folder
specify the restart.log using the -resume option:
swift -resume run001/restart.log ...
The restart log is from an 0.95 run, and I'm not quite sure if it will
work correctly with trunk.
There is no trunk module available on Midway, since we rebuild from
source to keep up to date with changes in the codebase.
Generally you can always get the latest trunk builds here, (atmost a
week older than last commit):
http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz
Thanks,
Yadu
On 12/04/2014 10:48 AM, Jonathan Ozik wrote:
> Thanks Yadu,
>
> I have a few questions.
> - How do I invoke swift and pass it the new swift.conf?
> - What is the “restart” procedure?
> - Is there a module I can load to use the latest swift trunk?
>
> Jonathan
>
>> On Dec 3, 2014, at 7:03 PM, Yadu Nand Babuji <yadunand at uchicago.edu
>> <mailto:yadunand at uchicago.edu>> wrote:
>>
>> Hi Jonathan,
>>
>> I believe some of the issues related to timeouts seen in your logs
>> are fixed/less likely in trunk
>> and would recommend that you try a run with that. I've also converted
>> your swift.properties to
>> the new swift.conf format. You can get a tested .conf file along with
>> a small test case from here:
>>
>> http://users.rcc.uchicago.edu/~yadunand/test_configs_package.tar.gz
>>
>> Here are some changes I've made to the conf:
>> lazyErrors: true and executionRetries: 0 so that long running jobs
>> are not retried.
>> staging set to direct, since you are running on the shared FS.
>> added worker logging and an app definition for debug.
>>
>> You can get the latest trunk build from here :
>> http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz
>>
>> Thanks,
>> Yadu
>>
>> On 12/03/2014 01:16 PM, Jonathan Ozik wrote:
>>> Hi Yadu,
>>>
>>> The tar.gz archive is here:
>>> https://www.dropbox.com/s/tt3ewapzaf0ygac/run001.tar.gz?dl=0
>>> I’m also attaching the swift.properties file that I used below.
>>>
>>> Thank you,
>>>
>>> Jonathan
>>>
>>>> On Dec 3, 2014, at 11:04 AM, Yadu Nand Babuji
>>>> <yadunand at uchicago.edu <mailto:yadunand at uchicago.edu>> wrote:
>>>>
>>>> Hi Jonathan,
>>>>
>>>> The issue you are seeing sounds pretty close to what David reported a
>>>> while back.
>>>> Could you send us a tar ball of your run directory from a failed run ?
>>>>
>>>> Could you also check if you've set lowOverAllocation and
>>>> highOverAllocation in your sites definition ?
>>>>
>>>> Thanks,
>>>> Yadu
>>>>
>>>> On 12/03/2014 10:50 AM, Ozik, Jonathan wrote:
>>>>> Hi all,
>>>>>
>>>>> I’m trying to run a large set of simulations on Midway using Swift
>>>>> 0.95-RC5.
>>>>> 768 of the 2187 tasks completed successfully and then I got the
>>>>> exception:
>>>>>
>>>>> exception @ swift-int.k, line: 530
>>>>> Caused by: Block task failed: Connection to worker lost
>>>>> org.globus.cog.coaster.TimeoutException: Channel timed out.
>>>>> lastTime=141203-145449.325, now=141203-145649.844,
>>>>> channel=TCPChannel [type: server, contact: 1202-5410010-000072-000000]
>>>>> at
>>>>> org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133)
>>>>> at
>>>>> org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124)
>>>>> at java.util.TimerThread.mainLoop(Timer.java:555)
>>>>> at java.util.TimerThread.run(Timer.java:505)
>>>>>
>>>>> Progress: Wed, 03 Dec 2014 14:59:51+0000 Submitted:651 Failed:6
>>>>> Finished successfully:768 Failed but can retry:762
>>>>> Progress: Wed, 03 Dec 2014 14:59:52+0000 Submitted:651 Failed:44
>>>>> Finished successfully:768 Failed but can retry:724
>>>>>
>>>>> And the process seems to have stopped.
>>>>>
>>>>> What log file would be helpful for diagnosing this?
>>>>>
>>>>> Jonathan
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Swift-user mailing list
>>>>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>
>>>> _______________________________________________
>>>> Swift-user mailing list
>>>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20141204/84599579/attachment.html>
More information about the Swift-user
mailing list