[Swift-devel] Problems in coaster block termination and restart
wilde at mcs.anl.gov
wilde at mcs.anl.gov
Sat Oct 2 07:43:40 CDT 2010
Mihael, can you look at the Swift run on TeraPort login1 at:
/scratch/local/wilde/SwiftR/swift.local.3162
This test ran about 20 iterations of the Swift-R test battery and then hung with 5 jobs in "active" state but not completing. Then Swift finally quit when a worker cut off an executing job (as I have retries off here).
You can see this in swift stdout in swift.stdouterr in that dir.
I *think* the run hanging has something to do with coaster block termination and restart. tc, sites.xml and swift.properties file (cf) are all in that directory. Command line to start swift was:
$SWIFTRBIN/../swift/bin/swift -config cf -tc.file tc -sites.file sites.xml $script -pipedir=$(pwd) >& swift.stdouterr </dev/null
$script is in that dir, rserver.swift.
Ran about 2600 jobs before hanging; it went through at least 2 rounds of coaster blocks before hanging.
I got about 7 emails from PBS on walltime exceeded. I suspect my sites.xml coaster parameters could use some tuning; its hard to determine the right time-block settings due to the dynamic and sporadic job submission rates. Specifically, in these tests, no R-evaluation job will run for more than about 15 seconds, but they get submitted in various bursts as the test proceeds; then the pattern repeats as the test is repeated. I suspect the bursts of job concurrency range from 1 to 15 jobs; maybe a bit less at the moment.
This is pretty high prio (for the Swift R release for OpenMX). But I will try to work around it with manual coaster blocks.
Im running at or close to the latest trunk revision.
Thanks,
- Mike
ory
More information about the Swift-devel
mailing list