[Swift-user] Block task failed: Connection to worker lost

Ozik, Jonathan jozik at anl.gov
Thu Dec 4 22:11:36 CST 2014


I’ve looked a bit closer into the differences between the different staging options, and chose the “local” option for now, even though this is probably not the most efficient in terms of creating unnecessarily large amounts of copies of the input files needed for each app invocation.
Speaking of which, in the User Guide (http://swift-lang.org/guides/trunk/userguide/userguide.html), there is a section that states “The wrapper script creates the application workspace directory; places the input files for that job into the application workspace directory using either cp or ln -s (depending on a configuration option)…,” but I couldn’t find any more information on enabling the symlinking of input files. Is this associated with a specific type of staging or configuration?

Jonathan

On Dec 4, 2014, at 5:57 PM, Ozik, Jonathan <jozik at anl.gov<mailto:jozik at anl.gov>> wrote:

The "staging: direct” option that’s included in the swift.conf file Yadu provided, I don’t seem to see a definition for it in the user guide. I’m having a path name issue and I suspect it could be something to do with the staging, but I’m not sure.

If I use a “-upf=filename.txt” command line argument to a swift script that includes the lines:

string upf_str = @arg("upf","unrolledParamFile.txt");
file params_file <single_file_mapper;file=upf_str>;

If I use the filename(params_file) command, would I get “filename.txt” with the default staging and the full path of the filename.txt file with the “direct” staging? Or is this a change between 0.95 RC5 and trunk?

Jonathan

On Dec 4, 2014, at 2:58 PM, Ozik, Jonathan <jozik at anl.gov<mailto:jozik at anl.gov>> wrote:

Thank you all,

The job is queued up now. I’ll update on the results.

Jonathan

On Dec 4, 2014, at 1:33 PM, Michael Wilde <wilde at anl.gov<mailto:wilde at anl.gov>> wrote:

We should (and will) add a getcwd( ) library function to eliminate this
particular need for java( ), though.

- Mike


On 12/4/14 1:23 PM, Yadu Nand Babuji wrote:
Hi Jonathan,

I rebuilt the trunk package with Mihael's fixes, and you can get it from
here :
http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz

-Yadu

On 12/04/2014 01:01 PM, Mihael Hategan wrote:
Hi Jonathan,

I fixed this in GIT. Yadu, can you compile the latest GIT please?

Mihael

On Thu, 2014-12-04 at 18:33 +0000, Ozik, Jonathan wrote:
Hi Yadu,

I’ve tried running with trunk and am getting a strange Java error this time:
No method: getProperty in java.lang.System with parameter types[class java.lang.String]
swiftscript:java @ repast, line: 267

at org.griphyn.vdl.karajan.lib.swiftscript.Java.getMethod(Java.java:192)
at org.griphyn.vdl.karajan.lib.swiftscript.Java.function(Java.java:162)
at org.griphyn.vdl.karajan.lib.SwiftFunction.runBody(SwiftFunction.java:77)
at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:175)
at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:165)
at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
at org.globus.cog.karajan.compiled.nodes.InternalFunction.run(InternalFunction.java:165)
at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
at org.globus.cog.karajan.compiled.nodes.Sequential.run(Sequential.java:41)
at org.globus.cog.karajan.compiled.nodes.CompoundNode.runChild(CompoundNode.java:110)
at org.globus.cog.karajan.compiled.nodes.UParallel$1.run(UParallel.java:91)
at k.thr.LWThread.run(LWThread.java:247)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Execution failed:
Error attempting to use: java.lang.System
swiftscript:java @ repast, line: 267

I think this is being triggered by the call:
string s = strcat(java("java.lang.System","getProperty","user.dir"),"/“);

Which worked just fine with 0.95 RC5.

Any thoughts?

Jonathan

On Dec 4, 2014, at 11:14 AM, Yadu Nand Babuji <yadunand at uchicago.edu<mailto:yadunand at uchicago.edu>> wrote:

Hi Jonathan,

If your config file is named swift.conf and is in the current directory, it will be automatically selected and you needn't specify
the file on the commandline, otherwise specify the config file using the -config option:
swift -config <path_to_config> <your_script.swift>

To resume from the log, say the restart.log in your run001 folder specify the restart.log using the -resume option:
swift -resume run001/restart.log  ...
The restart log is from an 0.95 run, and I'm not quite sure if it will work correctly with trunk.

There is no trunk module available on Midway, since we rebuild from source to keep up to date with changes in the codebase.

Generally you can always get the latest trunk builds here, (atmost a week older than last commit):
http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz

Thanks,
Yadu

On 12/04/2014 10:48 AM, Jonathan Ozik wrote:
Thanks Yadu,

I have a few questions.
- How do I invoke swift and pass it the new swift.conf?
- What is the “restart” procedure?
- Is there a module I can load to use the latest swift trunk?

Jonathan

On Dec 3, 2014, at 7:03 PM, Yadu Nand Babuji <yadunand at uchicago.edu<mailto:yadunand at uchicago.edu>> wrote:

Hi Jonathan,

I believe some of the issues related to timeouts seen in your logs are fixed/less likely in trunk
and would recommend that you try a run with that. I've also converted your swift.properties to
the new swift.conf format. You can get a tested .conf file along with a small test case from here:

http://users.rcc.uchicago.edu/~yadunand/test_configs_package.tar.gz<http://users.rcc.uchicago.edu/%7Eyadunand/test_configs_package.tar.gz>

Here are some changes I've made to the conf:
lazyErrors: true and executionRetries: 0 so that long running jobs are not retried.
staging set to direct, since you are running on the shared FS.
added worker logging and an app definition for debug.

You can get the latest trunk build from here : http://users.rcc.uchicago.edu/~yadunand/swift-trunk-latest.tar.gz<http://users.rcc.uchicago.edu/%7Eyadunand/swift-trunk-latest.tar.gz>

Thanks,
Yadu

On 12/03/2014 01:16 PM, Jonathan Ozik wrote:
Hi Yadu,

The tar.gz archive is here: https://www.dropbox.com/s/tt3ewapzaf0ygac/run001.tar.gz?dl=0
I’m also attaching the swift.properties file that I used below.

Thank you,

Jonathan

On Dec 3, 2014, at 11:04 AM, Yadu Nand Babuji <yadunand at uchicago.edu<mailto:yadunand at uchicago.edu>> wrote:

Hi Jonathan,

The issue you are seeing sounds pretty close to what David reported a
while back.
Could you send us a tar ball of your run directory from a failed run ?

Could you also check if you've set lowOverAllocation and
highOverAllocation in your sites definition ?

Thanks,
Yadu

On 12/03/2014 10:50 AM, Ozik, Jonathan wrote:
Hi all,

I’m trying to run a large set of simulations on Midway using Swift 0.95-RC5.
768 of the 2187 tasks completed successfully and then I got the exception:

exception @ swift-int.k, line: 530
Caused by: Block task failed: Connection to worker lost
org.globus.cog.coaster.TimeoutException: Channel timed out. lastTime=141203-145449.325, now=141203-145649.844, channel=TCPChannel [type: server, contact: 1202-5410010-000072-000000]
at org.globus.cog.coaster.channels.AbstractCoasterChannel.checkTimeouts(AbstractCoasterChannel.java:133)
at org.globus.cog.coaster.channels.AbstractCoasterChannel$1.run(AbstractCoasterChannel.java:124)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)

Progress: Wed, 03 Dec 2014 14:59:51+0000  Submitted:651  Failed:6  Finished successfully:768  Failed but can retry:762
Progress: Wed, 03 Dec 2014 14:59:52+0000  Submitted:651  Failed:44  Finished successfully:768  Failed but can retry:724

And the process seems to have stopped.

What log file would be helpful for diagnosing this?

Jonathan


_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user





_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

--
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago

_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20141205/7ddf9123/attachment.html>


More information about the Swift-user mailing list