[Swift-devel] CASP jobs hang - seems to be in coaster scheduling

wilde at mcs.anl.gov wilde at mcs.anl.gov
Thu Jul 1 08:23:37 CDT 2010


[Mihael: help urgently needed on this if possible]

Aashish, I see the runs you submitted around 3-4AM this morning in /home/aashish/CASP/{T0608,T0610,T0611}

Each of them show a similar problem to what we saw earlier last night with T0608: the script submits 300 jobs to the pads coaster pool, and none of them run.

In some of these scripts, the first round of 300 (boostThreader) work fine, but the later round of 300 loops jobs get "stuck".

Mihael, can you set aside some time as soon as possible this morning to look at these? These need to be submitted to CASP by 2PM CDT today, so attention to the problem is rather urgent.

The scripts are all coming from /home/aashish/RapLoops
The swift release is from /home/wilde/swift/src/stable/...

In the above directories, you will find all source for scripts, mappers, tc, and sites, as well as all logs. In some of the Tnnnn directories (each one is a protein target for the CASP competition) you will see multiple runs, each with an outN file log of stdout/err and then a run directory for that run with all relevant files.

This *looks* like the familiar problem of trying to run an app whose maxwalltime wont fit into any available coaster slot, but the times in tc and sites.xml dont seem to explain that behavior.

This script has been running well since May; "slight" changes were made to work around the unavailability of GPFS on PADS this week, but we still cant figure out why these scripts are hanging in this manner.

- Mike




More information about the Swift-devel mailing list