[Swift-devel] Re: Job ended mysteriously amidst coaster shutdowns

Mihael Hategan hategan at mcs.anl.gov
Mon Mar 22 12:54:21 CDT 2010


Which run?

[hategan at login1 run.boostthread.6573]$ ls -al *.log
-rw-r--r-- 1 wilde ci-users 8532855 Mar 22 11:24
BoostThreader-20100322-1044-yuu1ihp4.log
-rw-r--r-- 1 wilde ci-users   17889 Mar 22 12:47
BoostThreader-20100322-1247-brlv8d44.log
-rw-r--r-- 1 wilde ci-users 1197335 Mar 22 12:48
BoostThreader-20100322-1248-m01fv1k3.log
-rw-r--r-- 1 wilde ci-users     228 Mar 22 12:48 swift.log


If it's -1044, then no. It fails because it can't find a file at some
point:
 File not
found: /home/wilde/protests/run.boostthread.6573/BoostThreader-20100322-1044-yuu1ihp4/shared/Results.Models/T0411D1.07.pdb

If you have eager errors turned on, then a failure in the middle of the
run will cause swift and running jobs to abort.

On Mon, 2010-03-22 at 12:30 -0500, Michael Wilde wrote:
> Hi Mihael,
> 
> Can you look at the Swift run in this dir:
>   /home/wilde/protests/run.boostthread.6573
> (sites.xml, tc, swift.properties, and work directory are all under that dir)
> 
> It looks similar to the problem from last week, where coaster block shutdown is causing other running jobs to fail.
> 
> What may have happened here is that a block hit its time limit while an app was running (in this case the summary job, after 300 simulation jobs). Its also possible that the summary job itself failed but with a zero exit code, causing Swift to think it was done and to then start looking for its files. Im looking into that, and will try a swift restart on this run.
> 
> There were 32 one-core workers in this pool; about 24 were active for the core of the run.
> 
> I have this pool on PADS set to use <scratch> to avoid overheating GPFS, which makes debugging a bit harder.
> 
> But can you do a quick look and see if this looks to you like a coaster block shutdown problem?
> 
> AM I correct in assuming that when a coaster block hits its time limit in the middle of running an app(), it should restart the app?  And that it should not confuse this with an app() termination?
> 
> Thanks,
> 
> Mike
> 
> 




More information about the Swift-devel mailing list