[Swift-user] Block task failed: Connection to worker lost

Ozik, Jonathan jozik at anl.gov
Thu Dec 4 17:57:01 CST 2014


The "staging: direct” option that’s included in the swift.conf file Yadu provided, I don’t seem to see a definition for it in the user guide. I’m having a path name issue and I suspect it could be something to do with the staging, but I’m not sure.

If I use a “-upf=filename.txt” command line argument to a swift script that includes the lines:

string upf_str = @arg("upf","unrolledParamFile.txt");
file params_file <single_file_mapper;file=upf_str>;

If I use the filename(params_file) command, would I get “filename.txt” with the default staging and the full path of the filename.txt file with the “direct” staging? Or is this a change between 0.95 RC5 and trunk?

Jonathan

> On Dec 4, 2014, at 2:58 PM, Ozik, Jonathan <jozik at anl.gov> wrote:
> 
> Thank you all,
> 
> The job is queued up now. I’ll update on the results.
> 
> Jonathan
> 
>> On Dec 4, 2014, at 1:33 PM, Michael Wilde <wilde at anl.gov> wrote:
>> 
>> We should (and will) add a getcwd( ) library function to eliminate this 
>> particular need for java( ), though.
>> 
>> - Mike
>> 
>> 
>> On 12/4/14 1:23 PM, Yadu Nand Babuji wrote:
>>> Hi Jonathan,
>>> 
>>> I rebuilt the trunk package with Mihael's fixes, and you can get it from
>>> here :
>>> http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz
>>> 
>>> -Yadu
>>> 
>>> On 12/04/2014 01:01 PM, Mihael Hategan wrote:
>>>> Hi Jonathan,
>>>> 
>>>> I fixed this in GIT. Yadu, can you compile the latest GIT please?
>>>> 
>>>> Mihael
>>>> 
>>>> On Thu, 2014-12-04 at 18:33 +0000, Ozik, Jonathan wrote:
>>>>> Hi Yadu,
>>>>> 
>>>>> I’ve tried running with trunk and am getting a strange Java error this time:
>>>>> No method: getProperty in java.lang.System with parameter types[class java.lang.String]
>>>>> swiftscript:java @ repast, line: 267
>>>>> 
>>>>> at org.griphyn.vdl.karajan.lib.swiftscript.Java.getMethod(Java.java:192)
>>>>> at org.griphyn.vdl.karajan.lib.swiftscript.Java.function(Java.java:162)
>>>>> at org.griphyn.vdl.karajan.lib.SwiftFunction.runBody(SwiftFunction.java:77)
>>>>> at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:175)
>>>>> at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
>>>>> at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:165)
>>>>> at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
>>>>> at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:165)
>>>>> at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
>>>>> at org.globus.cog.karajan.compiled.nodes.Sequential.run(Sequential.java:41)
>>>>> at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
>>>>> at org.globus.cog.karajan.compiled.nodes.UParallel$1.run(UParallel.java:91)
>>>>> at k.thr.LWThread.run(LWThread.java:247)
>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>> 
>>>>> Execution failed:
>>>>> Error attempting to use: java.lang.System
>>>>> swiftscript:java @ repast, line: 267
>>>>> 
>>>>> I think this is being triggered by the call:
>>>>> string s = strcat(java("java.lang.System","getProperty","user.dir"),"/“);
>>>>> 
>>>>> Which worked just fine with 0.95 RC5.
>>>>> 
>>>>> Any thoughts?
>>>>> 
>>>>> Jonathan
>>>>> 
>>>>> On Dec 4, 2014, at 11:14 AM, Yadu Nand Babuji <yadunand at uchicago.edu<mailto:yadunand at uchicago.edu>> wrote:
>>>>> 
>>>>> Hi Jonathan,
>>>>> 
>>>>> If your config file is named swift.conf and is in the current directory, it will be automatically selected and you needn't specify
>>>>> the file on the commandline, otherwise specify the config file using the -config option:
>>>>> swift -config <path_to_config> <your_script.swift>
>>>>> 
>>>>> To resume from the log, say the restart.log in your run001 folder specify the restart.log using the -resume option:
>>>>> swift -resume run001/restart.log  ...
>>>>> The restart log is from an 0.95 run, and I'm not quite sure if it will work correctly with trunk.
>>>>> 
>>>>> There is no trunk module available on Midway, since we rebuild from source to keep up to date with changes in the codebase.
>>>>> 
>>>>> Generally you can always get the latest trunk builds here, (atmost a week older than last commit):
>>>>> http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz
>>>>> 
>>>>> Thanks,
>>>>> Yadu
>>>>> 
>>>>> On 12/04/2014 10:48 AM, Jonathan Ozik wrote:
>>>>> Thanks Yadu,
>>>>> 
>>>>> I have a few questions.
>>>>> - How do I invoke swift and pass it the new swift.conf?
>>>>> - What is the “restart” procedure?
>>>>> - Is there a module I can load to use the latest swift trunk?
>>>>> 
>>>>> Jonathan
>>>>> 
>>>>> On Dec 3, 2014, at 7:03 PM, Yadu Nand Babuji <yadunand at uchicago.edu<mailto:yadunand at uchicago.edu>> wrote:
>>>>> 
>>>>> Hi Jonathan,
>>>>> 
>>>>> I believe some of the issues related to timeouts seen in your logs are fixed/less likely in trunk
>>>>> and would recommend that you try a run with that. I've also converted your swift.properties to
>>>>> the new swift.conf format. You can get a tested .conf file along with a small test case from here:
>>>>> 
>>>>> http://users.rcc.uchicago.edu/~yadunand/test_configs_package.tar.gz<http://users.rcc.uchicago.edu/%7Eyadunand/test_configs_package.tar.gz>
>>>>> 
>>>>> Here are some changes I've made to the conf:
>>>>> lazyErrors: true and executionRetries: 0 so that long running jobs are not retried.
>>>>> staging set to direct, since you are running on the shared FS.
>>>>> added worker logging and an app definition for debug.
>>>>> 
>>>>> You can get the latest trunk build from here : http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz<http://users.rcc.uchicago.edu/%7Eyadunand/swift-trunk-latest.tar.gz>
>>>>> 
>>>>> Thanks,
>>>>> Yadu
>>>>> 
>>>>> On 12/03/2014 01:16 PM, Jonathan Ozik wrote:
>>>>> Hi Yadu,
>>>>> 
>>>>> The tar.gz archive is here: https://www.dropbox.com/s/tt3ewapzaf0ygac/run001.tar.gz?dl=0
>>>>> I’m also attaching the swift.properties file that I used below.
>>>>> 
>>>>> Thank you,
>>>>> 
>>>>> Jonathan
>>>>> 
>>>>> On Dec 3, 2014, at 11:04 AM, Yadu Nand Babuji <yadunand at uchicago.edu<mailto:yadunand at uchicago.edu>> wrote:
>>>>> 
>>>>> Hi Jonathan,
>>>>> 
>>>>> The issue you are seeing sounds pretty close to what David reported a
>>>>> while back.
>>>>> Could you send us a tar ball of your run directory from a failed run ?
>>>>> 
>>>>> Could you also check if you've set lowOverAllocation and
>>>>> highOverAllocation in your sites definition ?
>>>>> 
>>>>> Thanks,
>>>>> Yadu
>>>>> 
>>>>> On 12/03/2014 10:50 AM, Ozik, Jonathan wrote:
>>>>> Hi all,
>>>>> 
>>>>> I’m trying to run a large set of simulations on Midway using Swift 0.95-RC5.
>>>>> 768 of the 2187 tasks completed successfully and then I got the exception:
>>>>> 
>>>>> exception @ swift-int.k, line: 530
>>>>> Caused by: Block task failed: Connection to worker lost
>>>>> org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=141203-145449.325, now=141203-145649.844, channel=TCPChannel [type: server, contact: 1202-5410010-000072-000000]
>>>>> at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133)
>>>>> at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124)
>>>>> at java.util.TimerThread.mainLoop(Timer.java:555)
>>>>> at java.util.TimerThread.run(Timer.java:505)
>>>>> 
>>>>> Progress: Wed, 03 Dec 2014 14:59:51+0000  Submitted:651  Failed:6  Finished successfully:768  Failed but can retry:762
>>>>> Progress: Wed, 03 Dec 2014 14:59:52+0000  Submitted:651  Failed:44  Finished successfully:768  Failed but can retry:724
>>>>> 
>>>>> And the process seems to have stopped.
>>>>> 
>>>>> What log file would be helpful for diagnosing this?
>>>>> 
>>>>> Jonathan
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Swift-user mailing list
>>>>> Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>> 
>>>>> _______________________________________________
>>>>> Swift-user mailing list
>>>>> Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Swift-user mailing list
>>>>> Swift-user at ci.uchicago.edu
>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>> 
>> -- 
>> Michael Wilde
>> Mathematics and Computer Science          Computation Institute
>> Argonne National Laboratory               The University of Chicago
>> 
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user



More information about the Swift-user mailing list