[Swift-devel] Re: Job ended mysteriously amidst coaster shutdowns
Michael Wilde
wilde at mcs.anl.gov
Mon Mar 22 12:58:50 CDT 2010
----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> Which run?
> -rw-r--r-- 1 wilde ci-users 8532855 Mar 22 11:24
> BoostThreader-20100322-1044-yuu1ihp4.log
But hold off on looking for a moment; Ive just added "-resume" to re-execute the summary job (thats what these more recent logs were) and I think I'm seeing the same failure on a fresh coaster, so its most likely a problem in my script that cropped up at this larger scale.
Sorry for the likely false alarm.
- Mike
>
> [hategan at login1 run.boostthread.6573]$ ls -al *.log
> -rw-r--r-- 1 wilde ci-users 8532855 Mar 22 11:24
> BoostThreader-20100322-1044-yuu1ihp4.log
> -rw-r--r-- 1 wilde ci-users 17889 Mar 22 12:47
> BoostThreader-20100322-1247-brlv8d44.log
> -rw-r--r-- 1 wilde ci-users 1197335 Mar 22 12:48
> BoostThreader-20100322-1248-m01fv1k3.log
> -rw-r--r-- 1 wilde ci-users 228 Mar 22 12:48 swift.log
>
>
> If it's -1044, then no. It fails because it can't find a file at some
> point:
> File not
> found:
> /home/wilde/protests/run.boostthread.6573/BoostThreader-20100322-1044-yuu1ihp4/shared/Results.Models/T0411D1.07.pdb
>
> If you have eager errors turned on, then a failure in the middle of
> the
> run will cause swift and running jobs to abort.
>
> On Mon, 2010-03-22 at 12:30 -0500, Michael Wilde wrote:
> > Hi Mihael,
> >
> > Can you look at the Swift run in this dir:
> > /home/wilde/protests/run.boostthread.6573
> > (sites.xml, tc, swift.properties, and work directory are all under
> that dir)
> >
> > It looks similar to the problem from last week, where coaster block
> shutdown is causing other running jobs to fail.
> >
> > What may have happened here is that a block hit its time limit while
> an app was running (in this case the summary job, after 300 simulation
> jobs). Its also possible that the summary job itself failed but with a
> zero exit code, causing Swift to think it was done and to then start
> looking for its files. Im looking into that, and will try a swift
> restart on this run.
> >
> > There were 32 one-core workers in this pool; about 24 were active
> for the core of the run.
> >
> > I have this pool on PADS set to use <scratch> to avoid overheating
> GPFS, which makes debugging a bit harder.
> >
> > But can you do a quick look and see if this looks to you like a
> coaster block shutdown problem?
> >
> > AM I correct in assuming that when a coaster block hits its time
> limit in the middle of running an app(), it should restart the app?
> And that it should not confuse this with an app() termination?
> >
> > Thanks,
> >
> > Mike
> >
> >
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list