[Swift-devel] coaster termination problems cause large runs to hang

Michael Wilde wilde at mcs.anl.gov
Wed Jun 29 09:15:49 CDT 2011


Mihael, did you have a chance to assess the log of Papia's failing DSSAT run?

I think it shut down after approx 30K or 120K app invocations.

Looked like the run went bad when coaster workers shut down at the end of their maxtime slot.

Papia, in the meantime, can you change the following parameters and re-run?

* in cf, from:

execution.retries=0
lazy.errors=false

* to:

execution.retries=2
lazy.errors=true

* in sites.xml, from:
    <profile namespace="globus" key="maxWallTime">00:02:00</profile>
* to:
    <profile namespace="globus" key="maxWallTime">00:10:00</profile>

The cf change will make Swift retry any app that fails, eg, because it may have run longer than its maxwalltime estimate. ANd Swift will continue to execute runnable app calls even if other app calls have failed.

The sites change gives a longer time estimate for all app calls on the PADS site, so that Swift wont start an app on any coaster that has less than the amount of time (10 mins vs 2 mins) remaining in its run time allocation.

Finally, Papia, could you try this on both PADS and Beagle, first on 0.92.1 and then on the latest trunk (as of this morning)?

Thanks,

- Mike



----- Original Message -----
> >  (which I will copy to ~wilde/dssat.run01 on the CI net)?
> 
> bri$ pwd
> /home/wilde
> bri$ ls -ld dssat.run01/
> drwxr-xr-x 2 wilde ci-users 4096 Jun 24 19:01 dssat.run01//
> bri$
> 
> 
> ----- Original Message -----
> > [hategan at login ~]$ cd /home/papia/dssat/run01
> > -bash: cd: /home/papia/dssat/run01: Permission denied
> >
> >
> > On Fri, 2011-06-24 at 18:54 -0500, Michael Wilde wrote:
> > > Mihael,
> > >
> > > Papia is running large sweeps of the DSSAT land use model on PADS,
> > > and getting failures, it seems, when the coasters time out. Her
> > > script is attempting about 120K model invocations, each taking
> > > about
> > > 60 seconds to run. She gets between 30K and 60K of these done
> > > before
> > > it fails.
> > >
> > > Can you look at the example below, on the CI network in
> > > /home/papia/dssat/run01
> > >  (which I will copy to ~wilde/dssat.run01 on the CI net)?
> > >
> > > The swift.out file shows the run progressing nicely until the
> > > first
> > > coaster worker timeout occurs.
> > >
> > > The run was started with ./RunSweep.sh:
> > > time swift -tc.file tc -sites.file sites.xml -config cf
> > > RunDssat.swift >& swift.out
> > >
> > > The run id is RunID: 20110624-1333-r17fczk0
> > > Swift is 0.92.1.
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> > >
> > > login2$ head swift.out
> > > Swift svn swift-r4371 cog-r3096
> > >
> > > RunID: 20110624-1333-r17fczk0
> > > Progress:
> > > Progress: uninitialized:2
> > > Progress: Selecting site:36 Stage in:53 Submitting:1 Submitted:10
> > > Progress: Selecting site:36 Stage in:8 Submitting:2 Submitted:54
> > > Progress: Selecting site:36 Submitted:64
> > > Progress: Selecting site:36 Submitted:64
> > > Progress: Selecting site:36 Submitted:63 Active:1
> > > login2$ ls -l *zk0.log
> > > -rw-r--r-- 1 papia ci-users 161039247 Jun 24 17:25
> > > RunDssat-20110624-1333-r17fczk0.log
> > > login2$ pwd
> > > /home/papia/dssat/run01
> > > login2$
> > >
> > >
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list