[Swift-devel] Re: swift-falkon problem

Ben Clifford benc at hawaga.org.uk
Tue Mar 18 15:57:37 CDT 2008

I picked the first failed job in the log oyu sent. Job id 2qbcdypi.

I assume that your submit host and the various machines involved have 
properly synchronised clocks, but I have not checked this beyond seeing 
that the machine I am logged into has the same time as my laptop. I have 
labelled the times taken from different system clocks with lettered clock 
domains just in case they are different.

For this job, its running in thread 0-1-88.
The karajan level job submission goes through these states (in clock 
domain A)
23:14:08,196-0600 Submitting
23:14:08,204-0600 Submitted
23:14:14,121-0600 Active
23:14:14,121-0600 Completed

Note that the last two - Active and Completed - are the same (within a 

At 23:14:14,189-0600 Swift checks the job status and finds the success 
file is not found. (This timestamp is in clock domain A)

So now I look at for the status file myself on the fd filesystem:

$ ls --full-time 

-rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 

(this is in clock domain B)

And see that the file does exist but is a full 5 seconds after the job was 
reported as successful by provider-deef.

So now we can look in the info/ directory (next to the status directory) 
and get run time stamps or the jobs.

According to the info log, the job begins running at: (in clock domain B 
again) at:


which corresponds within about 60ms of the time that provider-deef 
reported the job as active.
However, the execution according to the wrapper log shows that the job did 
not finish executing until


(which is when the status file is approximately timestamped).

My off-the-cuff hypothesis is, based on the above, that soemwhere in 
provider-deef or below, the execution system is reporting a job as 
completed as soon as it starts executing, rather than when it actually 
finishes executing; and that successes with small numbers of jobs have 
been a race condition that would disappear if those small jobs took a 
substantially longer time to execute (eg if they had a sleep 30s in them).


More information about the Swift-devel mailing list