[Swift-devel] CASP jobs hang - seems to be in coaster scheduling

Mihael Hategan hategan at mcs.anl.gov
Thu Jul 1 10:49:15 CDT 2010


On Thu, 2010-07-01 at 10:23 -0500, Michael Wilde wrote:
> Sorry, false alarm - please ignore the request below.
> 
> The problem was indeed simply requesting a larger maxwalltime than any available coaster maxtime slot.
> 
> Can this be detected and a clear error message issued, as well as ending the run?

I thought it was. I can double-check.


> 
> - Mike
> 
> ----- wilde at mcs.anl.gov wrote:
> 
> > [Mihael: help urgently needed on this if possible]
> > 
> > Aashish, I see the runs you submitted around 3-4AM this morning in
> > /home/aashish/CASP/{T0608,T0610,T0611}
> > 
> > Each of them show a similar problem to what we saw earlier last night
> > with T0608: the script submits 300 jobs to the pads coaster pool, and
> > none of them run.
> > 
> > In some of these scripts, the first round of 300 (boostThreader) work
> > fine, but the later round of 300 loops jobs get "stuck".
> > 
> > Mihael, can you set aside some time as soon as possible this morning
> > to look at these? These need to be submitted to CASP by 2PM CDT today,
> > so attention to the problem is rather urgent.
> > 
> > The scripts are all coming from /home/aashish/RapLoops
> > The swift release is from /home/wilde/swift/src/stable/...
> > 
> > In the above directories, you will find all source for scripts,
> > mappers, tc, and sites, as well as all logs. In some of the Tnnnn
> > directories (each one is a protein target for the CASP competition)
> > you will see multiple runs, each with an outN file log of stdout/err
> > and then a run directory for that run with all relevant files.
> > 
> > This *looks* like the familiar problem of trying to run an app whose
> > maxwalltime wont fit into any available coaster slot, but the times in
> > tc and sites.xml dont seem to explain that behavior.
> > 
> > This script has been running well since May; "slight" changes were
> > made to work around the unavailability of GPFS on PADS this week, but
> > we still cant figure out why these scripts are hanging in this
> > manner.
> > 
> > - Mike
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 





More information about the Swift-devel mailing list