[Swift-devel] Job ended mysteriously amidst coaster shutdowns

Michael Wilde wilde at mcs.anl.gov
Mon Mar 22 12:30:02 CDT 2010


Hi Mihael,

Can you look at the Swift run in this dir:
  /home/wilde/protests/run.boostthread.6573
(sites.xml, tc, swift.properties, and work directory are all under that dir)

It looks similar to the problem from last week, where coaster block shutdown is causing other running jobs to fail.

What may have happened here is that a block hit its time limit while an app was running (in this case the summary job, after 300 simulation jobs). Its also possible that the summary job itself failed but with a zero exit code, causing Swift to think it was done and to then start looking for its files. Im looking into that, and will try a swift restart on this run.

There were 32 one-core workers in this pool; about 24 were active for the core of the run.

I have this pool on PADS set to use <scratch> to avoid overheating GPFS, which makes debugging a bit harder.

But can you do a quick look and see if this looks to you like a coaster block shutdown problem?

AM I correct in assuming that when a coaster block hits its time limit in the middle of running an app(), it should restart the app?  And that it should not confuse this with an app() termination?

Thanks,

Mike


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list