[Swift-devel] Problems in coaster block termination and restart

wilde at mcs.anl.gov wilde at mcs.anl.gov
Sat Oct 2 09:58:47 CDT 2010


Mihael, another test which encounters a similar but different problem:

This time I set the app's maxwalltime artificially higher to get longer-running coaster blocks. With this setting, 200 passes of the test battery ran perfectly, doing over 17,000 coaster jobs.

But then, after the tests completed and the swift iterate loop sat idle a while (maybe 30-60 minutes) I came back to it and started anther test loop. When the iterate woke up, swift was unable to run even the first coaster job.

You can see the logs for this in:

tp-login1.ci.uchicago.edu: /scratch/local/wilde/SwiftR/swift.local.8750

and I paste below the swift stdout that shows the problem (including a Java NPE).

Its possible that in both this run and the prior I have aggravated a coaster issue by fiddling with the worker.pl timeouts that I needed to change from prior persistent & passive coasters configs.  I'll look into that side of it.

- Mike

17,889 jobs finished OK for 200 passes of the test script.

It then sat idle a while, then I started another test:

Progress:  Finished successfully:17889
SwiftScript trace: rserver: got dir, /scratch/local/wilde/SwiftR/requests.P28845/R0002412
Progress:  Submitted:1  Finished successfully:17889

Then the timeouts and npe.

This was with coaster jobmanager local:local, so the swift log should show the coaster service activity.

- Mike

---
Progress:  Stage in:4  Submitting:1  Finished successfully:17883
Progress:  Checking status:1  Stage out:4  Finished successfully:17883
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
Progress:  Finished successfully:17889
SwiftScript trace: rserver: got dir, /scratch/local/wilde/SwiftR/requests.P28845/R0002412
Progress:  Submitted:1  Finished successfully:17889
Progress:  Submitted:1  Finished successfully:17889
Progress:  Submitted:1  Finished successfully:17889
Progress:  Submitted:1  Finished successfully:17889
Command(610, SUBMITJOB): handling reply timeout; sendReqTime=101002-093633.473, sendTime=101002-093633.474, now=101002-093833.490
Command(610, SUBMITJOB)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
        at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
Failed to shut down channel
java.lang.NullPointerException
        at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57)
        at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.<init>(AbstractKarajanChannel.java:52)
        at org.globus.cog.karajan.workflow.service.channels.NullChannel.<init>(NullChannel.java:18)
        at org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401)
        at org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411)
        at org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284)
        at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajan\
Channel.java:83)
        at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java\
:257)
Exception in thread "Sender" java.lang.NullPointerException
        at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57)
        at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.<init>(AbstractKarajanChannel.java:52)
        at org.globus.cog.karajan.workflow.service.channels.NullChannel.<init>(NullChannel.java:22)
        at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajan\
Channel.java:85)
        at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java\
:257)
Progress:  Submitted:1  Finished successfully:17889
Progress:  Submitted:1  Finished successfully:17889
Progress:  Submitted:1  Finished successfully:17889
Progress:  Submitted:1  Finished successfully:17889
Progress:  Submitted:1  Finished successfully:17889

---

----- wilde at mcs.anl.gov wrote:

> Mihael, can you look at the Swift run on TeraPort login1 at:
> 
> /scratch/local/wilde/SwiftR/swift.local.3162
> 
> This test ran about 20 iterations of the Swift-R test battery and then
> hung with 5 jobs in "active" state but not completing. Then Swift
> finally quit when a worker cut off an executing job (as I have retries
> off here).
> 
> You can see this in swift stdout in swift.stdouterr in that dir.
> 
> I *think* the run hanging has something to do with coaster block
> termination and restart.  tc, sites.xml and swift.properties file (cf)
> are all in that directory.  Command line to start swift was:
> 
> $SWIFTRBIN/../swift/bin/swift -config cf -tc.file tc -sites.file
> sites.xml $script -pipedir=$(pwd) >& swift.stdouterr </dev/null
> 
> $script is in that dir, rserver.swift.
> 
> Ran about 2600 jobs before hanging; it went through at least 2 rounds
> of coaster blocks before hanging.
> 
> I got about 7 emails from PBS on walltime exceeded.  I suspect my
> sites.xml coaster parameters could use some tuning; its hard to
> determine the right time-block settings due to the dynamic and
> sporadic job submission rates. Specifically, in these tests, no
> R-evaluation job will run for more than about 15 seconds, but they get
> submitted in various bursts as the test proceeds; then the pattern
> repeats as the test is repeated.  I suspect the bursts of job
> concurrency range from 1 to 15 jobs; maybe a bit less at the moment.
> 
> This is pretty high prio (for the Swift R release for OpenMX). But I
> will try to work around it with manual coaster blocks.
> 
> Im running at or close to the latest trunk revision.
> 
> Thanks,
> 
> - Mike
> ory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list