[Swift-user] Block task failed: Connection to worker lost

Ozik, Jonathan jozik at anl.gov
Thu Dec 4 12:33:23 CST 2014


Hi Yadu,

I’ve tried running with trunk and am getting a strange Java error this time:
No method: getProperty in java.lang.System with parameter types[class java.lang.String]
swiftscript:java @ repast, line: 267

at org.griphyn.vdl.karajan.lib.swiftscript.Java.getMethod(Java.java:192)
at org.griphyn.vdl.karajan.lib.swiftscript.Java.function(Java.java:162)
at org.griphyn.vdl.karajan.lib.SwiftFunction.runBody(SwiftFunction.java:77)
at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:175)
at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:165)
at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:165)
at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
at org.globus.cog.karajan.compiled.nodes.Sequential.run(Sequential.java:41)
at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
at org.globus.cog.karajan.compiled.nodes.UParallel$1.run(UParallel.java:91)
at k.thr.LWThread.run(LWThread.java:247)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Execution failed:
Error attempting to use: java.lang.System
swiftscript:java @ repast, line: 267

I think this is being triggered by the call:
string s = strcat(java("java.lang.System","getProperty","user.dir"),"/“);

Which worked just fine with 0.95 RC5.

Any thoughts?

Jonathan

On Dec 4, 2014, at 11:14 AM, Yadu Nand Babuji <yadunand at uchicago.edu<mailto:yadunand at uchicago.edu>> wrote:

Hi Jonathan,

If your config file is named swift.conf and is in the current directory, it will be automatically selected and you needn't specify
the file on the commandline, otherwise specify the config file using the -config option:
swift -config <path_to_config> <your_script.swift>

To resume from the log, say the restart.log in your run001 folder specify the restart.log using the -resume option:
swift -resume run001/restart.log  ...
The restart log is from an 0.95 run, and I'm not quite sure if it will work correctly with trunk.

There is no trunk module available on Midway, since we rebuild from source to keep up to date with changes in the codebase.

Generally you can always get the latest trunk builds here, (atmost a week older than last commit):
http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz

Thanks,
Yadu

On 12/04/2014 10:48 AM, Jonathan Ozik wrote:
Thanks Yadu,

I have a few questions.
- How do I invoke swift and pass it the new swift.conf?
- What is the “restart” procedure?
- Is there a module I can load to use the latest swift trunk?

Jonathan

On Dec 3, 2014, at 7:03 PM, Yadu Nand Babuji <yadunand at uchicago.edu<mailto:yadunand at uchicago.edu>> wrote:

Hi Jonathan,

I believe some of the issues related to timeouts seen in your logs are fixed/less likely in trunk
and would recommend that you try a run with that. I've also converted your swift.properties to
the new swift.conf format. You can get a tested .conf file along with a small test case from here:

http://users.rcc.uchicago.edu/~yadunand/test_configs_package.tar.gz<http://users.rcc.uchicago.edu/%7Eyadunand/test_configs_package.tar.gz>

Here are some changes I've made to the conf:
lazyErrors: true and executionRetries: 0 so that long running jobs are not retried.
staging set to direct, since you are running on the shared FS.
added worker logging and an app definition for debug.

You can get the latest trunk build from here : http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz<http://users.rcc.uchicago.edu/%7Eyadunand/swift-trunk-latest.tar.gz>

Thanks,
Yadu

On 12/03/2014 01:16 PM, Jonathan Ozik wrote:
Hi Yadu,

The tar.gz archive is here: https://www.dropbox.com/s/tt3ewapzaf0ygac/run001.tar.gz?dl=0
I’m also attaching the swift.properties file that I used below.

Thank you,

Jonathan

On Dec 3, 2014, at 11:04 AM, Yadu Nand Babuji <yadunand at uchicago.edu<mailto:yadunand at uchicago.edu>> wrote:

Hi Jonathan,

The issue you are seeing sounds pretty close to what David reported a
while back.
Could you send us a tar ball of your run directory from a failed run ?

Could you also check if you've set lowOverAllocation and
highOverAllocation in your sites definition ?

Thanks,
Yadu

On 12/03/2014 10:50 AM, Ozik, Jonathan wrote:
Hi all,

I’m trying to run a large set of simulations on Midway using Swift 0.95-RC5.
768 of the 2187 tasks completed successfully and then I got the exception:

exception @ swift-int.k, line: 530
Caused by: Block task failed: Connection to worker lost
org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=141203-145449.325, now=141203-145649.844, channel=TCPChannel [type: server, contact: 1202-5410010-000072-000000]
at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133)
at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)

Progress: Wed, 03 Dec 2014 14:59:51+0000  Submitted:651  Failed:6  Finished successfully:768  Failed but can retry:762
Progress: Wed, 03 Dec 2014 14:59:52+0000  Submitted:651  Failed:44  Finished successfully:768  Failed but can retry:724

And the process seems to have stopped.

What log file would be helpful for diagnosing this?

Jonathan


_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20141204/3054eb08/attachment.html>


More information about the Swift-user mailing list