[Swift-devel] coaster termination problems cause large runs to hang

Michael Wilde wilde at mcs.anl.gov
Fri Jun 24 18:54:34 CDT 2011


Mihael,

Papia is running large sweeps of the DSSAT land use model on PADS, and getting failures, it seems, when the coasters time out. Her script is attempting about 120K model invocations, each taking about 60 seconds to run.  She gets between 30K and 60K of these done before it fails.

Can you look at the example below, on the CI network in /home/papia/dssat/run01
 (which I will copy to ~wilde/dssat.run01 on the CI net)?

The swift.out file shows the run progressing nicely until the first coaster worker timeout occurs.

The run was started with ./RunSweep.sh:
time swift -tc.file tc -sites.file sites.xml -config cf  RunDssat.swift >& swift.out

The run id is RunID: 20110624-1333-r17fczk0
Swift is 0.92.1.

Thanks,

Mike


login2$ head swift.out
Swift svn swift-r4371 cog-r3096

RunID: 20110624-1333-r17fczk0
Progress:
Progress:  uninitialized:2
Progress:  Selecting site:36  Stage in:53  Submitting:1  Submitted:10
Progress:  Selecting site:36  Stage in:8  Submitting:2  Submitted:54
Progress:  Selecting site:36  Submitted:64
Progress:  Selecting site:36  Submitted:64
Progress:  Selecting site:36  Submitted:63  Active:1
login2$ ls -l *zk0.log
-rw-r--r-- 1 papia ci-users 161039247 Jun 24 17:25 RunDssat-20110624-1333-r17fczk0.log
login2$ pwd
/home/papia/dssat/run01
login2$ 


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list