[Swift-devel] Re: Job ended mysteriously amidst coaster shutdowns

Michael Wilde wilde at mcs.anl.gov
Mon Mar 22 12:58:50 CDT 2010


----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:

> Which run?

> -rw-r--r-- 1 wilde ci-users 8532855 Mar 22 11:24
> BoostThreader-20100322-1044-yuu1ihp4.log

But hold off on looking for a moment; Ive just added "-resume" to re-execute the summary job (thats what these more recent logs were) and I think I'm seeing the same failure on a fresh coaster, so its most likely a problem in my script that cropped up at this larger scale.

Sorry for the likely false alarm.

- Mike


> 
> [hategan at login1 run.boostthread.6573]$ ls -al *.log
> -rw-r--r-- 1 wilde ci-users 8532855 Mar 22 11:24
> BoostThreader-20100322-1044-yuu1ihp4.log
> -rw-r--r-- 1 wilde ci-users   17889 Mar 22 12:47
> BoostThreader-20100322-1247-brlv8d44.log
> -rw-r--r-- 1 wilde ci-users 1197335 Mar 22 12:48
> BoostThreader-20100322-1248-m01fv1k3.log
> -rw-r--r-- 1 wilde ci-users     228 Mar 22 12:48 swift.log
> 
> 
> If it's -1044, then no. It fails because it can't find a file at some
> point:
>  File not
> found:
> /home/wilde/protests/run.boostthread.6573/BoostThreader-20100322-1044-yuu1ihp4/shared/Results.Models/T0411D1.07.pdb
> 
> If you have eager errors turned on, then a failure in the middle of
> the
> run will cause swift and running jobs to abort.
> 
> On Mon, 2010-03-22 at 12:30 -0500, Michael Wilde wrote:
> > Hi Mihael,
> > 
> > Can you look at the Swift run in this dir:
> >   /home/wilde/protests/run.boostthread.6573
> > (sites.xml, tc, swift.properties, and work directory are all under
> that dir)
> > 
> > It looks similar to the problem from last week, where coaster block
> shutdown is causing other running jobs to fail.
> > 
> > What may have happened here is that a block hit its time limit while
> an app was running (in this case the summary job, after 300 simulation
> jobs). Its also possible that the summary job itself failed but with a
> zero exit code, causing Swift to think it was done and to then start
> looking for its files. Im looking into that, and will try a swift
> restart on this run.
> > 
> > There were 32 one-core workers in this pool; about 24 were active
> for the core of the run.
> > 
> > I have this pool on PADS set to use <scratch> to avoid overheating
> GPFS, which makes debugging a bit harder.
> > 
> > But can you do a quick look and see if this looks to you like a
> coaster block shutdown problem?
> > 
> > AM I correct in assuming that when a coaster block hits its time
> limit in the middle of running an app(), it should restart the app? 
> And that it should not confuse this with an app() termination?
> > 
> > Thanks,
> > 
> > Mike
> > 
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list