[Swift-devel] Problems in coaster block termination and restart

wilde at mcs.anl.gov wilde at mcs.anl.gov
Sat Oct 2 07:43:40 CDT 2010


Mihael, can you look at the Swift run on TeraPort login1 at:

/scratch/local/wilde/SwiftR/swift.local.3162

This test ran about 20 iterations of the Swift-R test battery and then hung with 5 jobs in "active" state but not completing. Then Swift finally quit when a worker cut off an executing job (as I have retries off here).

You can see this in swift stdout in swift.stdouterr in that dir.

I *think* the run hanging has something to do with coaster block termination and restart.  tc, sites.xml and swift.properties file (cf) are all in that directory.  Command line to start swift was:

$SWIFTRBIN/../swift/bin/swift -config cf -tc.file tc -sites.file sites.xml $script -pipedir=$(pwd) >& swift.stdouterr </dev/null

$script is in that dir, rserver.swift.

Ran about 2600 jobs before hanging; it went through at least 2 rounds of coaster blocks before hanging.

I got about 7 emails from PBS on walltime exceeded.  I suspect my sites.xml coaster parameters could use some tuning; its hard to determine the right time-block settings due to the dynamic and sporadic job submission rates. Specifically, in these tests, no R-evaluation job will run for more than about 15 seconds, but they get submitted in various bursts as the test proceeds; then the pattern repeats as the test is repeated.  I suspect the bursts of job concurrency range from 1 to 15 jobs; maybe a bit less at the moment.

This is pretty high prio (for the Swift R release for OpenMX). But I will try to work around it with manual coaster blocks.

Im running at or close to the latest trunk revision.

Thanks,

- Mike
ory




More information about the Swift-devel mailing list