[Swift-devel] coaster termination problems cause large runs to hang

Mihael Hategan hategan at mcs.anl.gov
Mon Jun 27 13:40:23 CDT 2011


[hategan at login ~]$ cd  /home/papia/dssat/run01
-bash: cd: /home/papia/dssat/run01: Permission denied


On Fri, 2011-06-24 at 18:54 -0500, Michael Wilde wrote:
> Mihael,
> 
> Papia is running large sweeps of the DSSAT land use model on PADS, and getting failures, it seems, when the coasters time out. Her script is attempting about 120K model invocations, each taking about 60 seconds to run.  She gets between 30K and 60K of these done before it fails.
> 
> Can you look at the example below, on the CI network in /home/papia/dssat/run01
>  (which I will copy to ~wilde/dssat.run01 on the CI net)?
> 
> The swift.out file shows the run progressing nicely until the first coaster worker timeout occurs.
> 
> The run was started with ./RunSweep.sh:
> time swift -tc.file tc -sites.file sites.xml -config cf  RunDssat.swift >& swift.out
> 
> The run id is RunID: 20110624-1333-r17fczk0
> Swift is 0.92.1.
> 
> Thanks,
> 
> Mike
> 
> 
> login2$ head swift.out
> Swift svn swift-r4371 cog-r3096
> 
> RunID: 20110624-1333-r17fczk0
> Progress:
> Progress:  uninitialized:2
> Progress:  Selecting site:36  Stage in:53  Submitting:1  Submitted:10
> Progress:  Selecting site:36  Stage in:8  Submitting:2  Submitted:54
> Progress:  Selecting site:36  Submitted:64
> Progress:  Selecting site:36  Submitted:64
> Progress:  Selecting site:36  Submitted:63  Active:1
> login2$ ls -l *zk0.log
> -rw-r--r-- 1 papia ci-users 161039247 Jun 24 17:25 RunDssat-20110624-1333-r17fczk0.log
> login2$ pwd
> /home/papia/dssat/run01
> login2$ 
> 
> 





More information about the Swift-devel mailing list