[Swift-devel] coaster termination problems cause large runs to hang

Mihael Hategan hategan at mcs.anl.gov
Wed Jun 29 11:17:16 CDT 2011


Yes, but I can't tell why the service isn't shutting down.

I do see however that qdel doesn't work properly there.

As to why jobs are failing in the first place, I don't know. What's the
error on stdout?

On Wed, 2011-06-29 at 09:15 -0500, Michael Wilde wrote:
> Mihael, did you have a chance to assess the log of Papia's failing DSSAT run?
> 
> I think it shut down after approx 30K or 120K app invocations.
> 
> Looked like the run went bad when coaster workers shut down at the end of their maxtime slot.
> 
> Papia, in the meantime, can you change the following parameters and re-run?
> 
> * in cf, from:
> 
> execution.retries=0
> lazy.errors=false
> 
> * to:
> 
> execution.retries=2
> lazy.errors=true
> 
> * in sites.xml, from:
>     <profile namespace="globus" key="maxWallTime">00:02:00</profile>
> * to:
>     <profile namespace="globus" key="maxWallTime">00:10:00</profile>
> 
> The cf change will make Swift retry any app that fails, eg, because it may have run longer than its maxwalltime estimate. ANd Swift will continue to execute runnable app calls even if other app calls have failed.
> 
> The sites change gives a longer time estimate for all app calls on the PADS site, so that Swift wont start an app on any coaster that has less than the amount of time (10 mins vs 2 mins) remaining in its run time allocation.
> 
> Finally, Papia, could you try this on both PADS and Beagle, first on 0.92.1 and then on the latest trunk (as of this morning)?
> 
> Thanks,
> 
> - Mike
> 
> 
> 
> ----- Original Message -----
> > >  (which I will copy to ~wilde/dssat.run01 on the CI net)?
> > 
> > bri$ pwd
> > /home/wilde
> > bri$ ls -ld dssat.run01/
> > drwxr-xr-x 2 wilde ci-users 4096 Jun 24 19:01 dssat.run01//
> > bri$
> > 
> > 
> > ----- Original Message -----
> > > [hategan at login ~]$ cd /home/papia/dssat/run01
> > > -bash: cd: /home/papia/dssat/run01: Permission denied
> > >
> > >
> > > On Fri, 2011-06-24 at 18:54 -0500, Michael Wilde wrote:
> > > > Mihael,
> > > >
> > > > Papia is running large sweeps of the DSSAT land use model on PADS,
> > > > and getting failures, it seems, when the coasters time out. Her
> > > > script is attempting about 120K model invocations, each taking
> > > > about
> > > > 60 seconds to run. She gets between 30K and 60K of these done
> > > > before
> > > > it fails.
> > > >
> > > > Can you look at the example below, on the CI network in
> > > > /home/papia/dssat/run01
> > > >  (which I will copy to ~wilde/dssat.run01 on the CI net)?
> > > >
> > > > The swift.out file shows the run progressing nicely until the
> > > > first
> > > > coaster worker timeout occurs.
> > > >
> > > > The run was started with ./RunSweep.sh:
> > > > time swift -tc.file tc -sites.file sites.xml -config cf
> > > > RunDssat.swift >& swift.out
> > > >
> > > > The run id is RunID: 20110624-1333-r17fczk0
> > > > Swift is 0.92.1.
> > > >
> > > > Thanks,
> > > >
> > > > Mike
> > > >
> > > >
> > > > login2$ head swift.out
> > > > Swift svn swift-r4371 cog-r3096
> > > >
> > > > RunID: 20110624-1333-r17fczk0
> > > > Progress:
> > > > Progress: uninitialized:2
> > > > Progress: Selecting site:36 Stage in:53 Submitting:1 Submitted:10
> > > > Progress: Selecting site:36 Stage in:8 Submitting:2 Submitted:54
> > > > Progress: Selecting site:36 Submitted:64
> > > > Progress: Selecting site:36 Submitted:64
> > > > Progress: Selecting site:36 Submitted:63 Active:1
> > > > login2$ ls -l *zk0.log
> > > > -rw-r--r-- 1 papia ci-users 161039247 Jun 24 17:25
> > > > RunDssat-20110624-1333-r17fczk0.log
> > > > login2$ pwd
> > > > /home/papia/dssat/run01
> > > > login2$
> > > >
> > > >
> > 
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> 





More information about the Swift-devel mailing list