[Swift-devel] Re: swift-falkon problem

Wed Mar 19 10:45:50 CDT 2008

Following up on Ben's request from the msg below:

 > My off-the-cuff hypothesis is, based on the above, that soemwhere in
 > provider-deef or below, the execution system is reporting a job as
 > completed as soon as it starts executing, rather than when it actually
 > finishes executing; and that successes with small numbers of jobs have
 > been a race condition that would disappear if those small jobs took a
 > substantially longer time to execute (eg if they had a sleep 30s in 
them).
 >

I tested the following:

run314: 100 jobs, 10 workers: all finished OK

run315: 100 jobs, 110 workers: ~80% failed

run316: 100 jobs, 110 workers, sleep 30 in the app: all finished OK

These are in ~benc/swift-logs/wilde. The workdirs are preserved on 
bblogin/sico - I did not copy them because you need access to the msec 
timestamps anyways.

I can run these several times each to get more data before we assess the 
hypothesis, but didnt have time yet. Let me know if thats needed.

I'm cautiously leaning a bit more to the NFS-race theory. I would like 
to test with scp data transfer.  Am also trying to get gridftp compiled 
there with help from Raj.  Build is failing with gpt problems, I think I 
need Ben or Charles on this.

- Mike

On 3/18/08 3:57 PM, Ben Clifford wrote:
> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
> 
> I assume that your submit host and the various machines involved have 
> properly synchronised clocks, but I have not checked this beyond seeing 
> that the machine I am logged into has the same time as my laptop. I have 
> labelled the times taken from different system clocks with lettered clock 
> domains just in case they are different.
> 
> For this job, its running in thread 0-1-88.
> The karajan level job submission goes through these states (in clock 
> domain A)
> 23:14:08,196-0600 Submitting
> 23:14:08,204-0600 Submitted
> 23:14:14,121-0600 Active
> 23:14:14,121-0600 Completed
> 
> Note that the last two - Active and Completed - are the same (within a 
> millisecond)
> 
> At 23:14:14,189-0600 Swift checks the job status and finds the success 
> file is not found. (This timestamp is in clock domain A)
> 
> So now I look at for the status file myself on the fd filesystem:
> 
> $ ls --full-time 
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
> 
> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
> 
> (this is in clock domain B)
> 
> And see that the file does exist but is a full 5 seconds after the job was 
> reported as successful by provider-deef.
> 
> So now we can look in the info/ directory (next to the status directory) 
> and get run time stamps or the jobs.
> 
> According to the info log, the job begins running at: (in clock domain B 
> again) at:
> 
> 00:14:14.065373000-0500
> 
> which corresponds within about 60ms of the time that provider-deef 
> reported the job as active.
> However, the execution according to the wrapper log shows that the job did 
> not finish executing until
> 
> 00:14:19.233438000-0500
> 
> (which is when the status file is approximately timestamped).
> 
> My off-the-cuff hypothesis is, based on the above, that soemwhere in 
> provider-deef or below, the execution system is reporting a job as 
> completed as soon as it starts executing, rather than when it actually 
> finishes executing; and that successes with small numbers of jobs have 
> been a race condition that would disappear if those small jobs took a 
> substantially longer time to execute (eg if they had a sleep 30s in them).
>