[Swift-devel] coaster termination problems cause large runs to hang
Mihael Hategan
hategan at mcs.anl.gov
Mon Jun 27 13:40:23 CDT 2011
[hategan at login ~]$ cd /home/papia/dssat/run01
-bash: cd: /home/papia/dssat/run01: Permission denied
On Fri, 2011-06-24 at 18:54 -0500, Michael Wilde wrote:
> Mihael,
>
> Papia is running large sweeps of the DSSAT land use model on PADS, and getting failures, it seems, when the coasters time out. Her script is attempting about 120K model invocations, each taking about 60 seconds to run. She gets between 30K and 60K of these done before it fails.
>
> Can you look at the example below, on the CI network in /home/papia/dssat/run01
> (which I will copy to ~wilde/dssat.run01 on the CI net)?
>
> The swift.out file shows the run progressing nicely until the first coaster worker timeout occurs.
>
> The run was started with ./RunSweep.sh:
> time swift -tc.file tc -sites.file sites.xml -config cf RunDssat.swift >& swift.out
>
> The run id is RunID: 20110624-1333-r17fczk0
> Swift is 0.92.1.
>
> Thanks,
>
> Mike
>
>
> login2$ head swift.out
> Swift svn swift-r4371 cog-r3096
>
> RunID: 20110624-1333-r17fczk0
> Progress:
> Progress: uninitialized:2
> Progress: Selecting site:36 Stage in:53 Submitting:1 Submitted:10
> Progress: Selecting site:36 Stage in:8 Submitting:2 Submitted:54
> Progress: Selecting site:36 Submitted:64
> Progress: Selecting site:36 Submitted:64
> Progress: Selecting site:36 Submitted:63 Active:1
> login2$ ls -l *zk0.log
> -rw-r--r-- 1 papia ci-users 161039247 Jun 24 17:25 RunDssat-20110624-1333-r17fczk0.log
> login2$ pwd
> /home/papia/dssat/run01
> login2$
>
>
More information about the Swift-devel
mailing list