[Swift-devel] Re: Job ended mysteriously amidst coaster shutdowns
Mihael Hategan
hategan at mcs.anl.gov
Mon Mar 22 12:54:21 CDT 2010
Which run?
[hategan at login1 run.boostthread.6573]$ ls -al *.log
-rw-r--r-- 1 wilde ci-users 8532855 Mar 22 11:24
BoostThreader-20100322-1044-yuu1ihp4.log
-rw-r--r-- 1 wilde ci-users 17889 Mar 22 12:47
BoostThreader-20100322-1247-brlv8d44.log
-rw-r--r-- 1 wilde ci-users 1197335 Mar 22 12:48
BoostThreader-20100322-1248-m01fv1k3.log
-rw-r--r-- 1 wilde ci-users 228 Mar 22 12:48 swift.log
If it's -1044, then no. It fails because it can't find a file at some
point:
File not
found: /home/wilde/protests/run.boostthread.6573/BoostThreader-20100322-1044-yuu1ihp4/shared/Results.Models/T0411D1.07.pdb
If you have eager errors turned on, then a failure in the middle of the
run will cause swift and running jobs to abort.
On Mon, 2010-03-22 at 12:30 -0500, Michael Wilde wrote:
> Hi Mihael,
>
> Can you look at the Swift run in this dir:
> /home/wilde/protests/run.boostthread.6573
> (sites.xml, tc, swift.properties, and work directory are all under that dir)
>
> It looks similar to the problem from last week, where coaster block shutdown is causing other running jobs to fail.
>
> What may have happened here is that a block hit its time limit while an app was running (in this case the summary job, after 300 simulation jobs). Its also possible that the summary job itself failed but with a zero exit code, causing Swift to think it was done and to then start looking for its files. Im looking into that, and will try a swift restart on this run.
>
> There were 32 one-core workers in this pool; about 24 were active for the core of the run.
>
> I have this pool on PADS set to use <scratch> to avoid overheating GPFS, which makes debugging a bit harder.
>
> But can you do a quick look and see if this looks to you like a coaster block shutdown problem?
>
> AM I correct in assuming that when a coaster block hits its time limit in the middle of running an app(), it should restart the app? And that it should not confuse this with an app() termination?
>
> Thanks,
>
> Mike
>
>
More information about the Swift-devel
mailing list