[Swift-devel] Re: swift-falkon problem

Ben Clifford benc at hawaga.org.uk
Tue Mar 18 15:57:37 CDT 2008


I picked the first failed job in the log oyu sent. Job id 2qbcdypi.

I assume that your submit host and the various machines involved have 
properly synchronised clocks, but I have not checked this beyond seeing 
that the machine I am logged into has the same time as my laptop. I have 
labelled the times taken from different system clocks with lettered clock 
domains just in case they are different.

For this job, its running in thread 0-1-88.
The karajan level job submission goes through these states (in clock 
domain A)
23:14:08,196-0600 Submitting
23:14:08,204-0600 Submitted
23:14:14,121-0600 Active
23:14:14,121-0600 Completed

Note that the last two - Active and Completed - are the same (within a 
millisecond)

At 23:14:14,189-0600 Swift checks the job status and finds the success 
file is not found. (This timestamp is in clock domain A)

So now I look at for the status file myself on the fd filesystem:

$ ls --full-time 
/home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 

-rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
/home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success

(this is in clock domain B)

And see that the file does exist but is a full 5 seconds after the job was 
reported as successful by provider-deef.

So now we can look in the info/ directory (next to the status directory) 
and get run time stamps or the jobs.

According to the info log, the job begins running at: (in clock domain B 
again) at:

00:14:14.065373000-0500

which corresponds within about 60ms of the time that provider-deef 
reported the job as active.
However, the execution according to the wrapper log shows that the job did 
not finish executing until

00:14:19.233438000-0500

(which is when the status file is approximately timestamped).

My off-the-cuff hypothesis is, based on the above, that soemwhere in 
provider-deef or below, the execution system is reporting a job as 
completed as soon as it starts executing, rather than when it actually 
finishes executing; and that successes with small numbers of jobs have 
been a race condition that would disappear if those small jobs took a 
substantially longer time to execute (eg if they had a sleep 30s in them).

-- 



More information about the Swift-devel mailing list