[Swift-devel] Re: swift-falkon problem
Ben Clifford
benc at hawaga.org.uk
Tue Mar 18 15:57:37 CDT 2008
I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
I assume that your submit host and the various machines involved have
properly synchronised clocks, but I have not checked this beyond seeing
that the machine I am logged into has the same time as my laptop. I have
labelled the times taken from different system clocks with lettered clock
domains just in case they are different.
For this job, its running in thread 0-1-88.
The karajan level job submission goes through these states (in clock
domain A)
23:14:08,196-0600 Submitting
23:14:08,204-0600 Submitted
23:14:14,121-0600 Active
23:14:14,121-0600 Completed
Note that the last two - Active and Completed - are the same (within a
millisecond)
At 23:14:14,189-0600 Swift checks the job status and finds the success
file is not found. (This timestamp is in clock domain A)
So now I look at for the status file myself on the fd filesystem:
$ ls --full-time
/home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
-rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500
/home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
(this is in clock domain B)
And see that the file does exist but is a full 5 seconds after the job was
reported as successful by provider-deef.
So now we can look in the info/ directory (next to the status directory)
and get run time stamps or the jobs.
According to the info log, the job begins running at: (in clock domain B
again) at:
00:14:14.065373000-0500
which corresponds within about 60ms of the time that provider-deef
reported the job as active.
However, the execution according to the wrapper log shows that the job did
not finish executing until
00:14:19.233438000-0500
(which is when the status file is approximately timestamped).
My off-the-cuff hypothesis is, based on the above, that soemwhere in
provider-deef or below, the execution system is reporting a job as
completed as soon as it starts executing, rather than when it actually
finishes executing; and that successes with small numbers of jobs have
been a race condition that would disappear if those small jobs took a
substantially longer time to execute (eg if they had a sleep 30s in them).
--
More information about the Swift-devel
mailing list