[Swift-devel] Re: swift-falkon problem
Michael Wilde
wilde at mcs.anl.gov
Wed Mar 19 10:45:50 CDT 2008
Following up on Ben's request from the msg below:
> My off-the-cuff hypothesis is, based on the above, that soemwhere in
> provider-deef or below, the execution system is reporting a job as
> completed as soon as it starts executing, rather than when it actually
> finishes executing; and that successes with small numbers of jobs have
> been a race condition that would disappear if those small jobs took a
> substantially longer time to execute (eg if they had a sleep 30s in
them).
>
I tested the following:
run314: 100 jobs, 10 workers: all finished OK
run315: 100 jobs, 110 workers: ~80% failed
run316: 100 jobs, 110 workers, sleep 30 in the app: all finished OK
These are in ~benc/swift-logs/wilde. The workdirs are preserved on
bblogin/sico - I did not copy them because you need access to the msec
timestamps anyways.
I can run these several times each to get more data before we assess the
hypothesis, but didnt have time yet. Let me know if thats needed.
I'm cautiously leaning a bit more to the NFS-race theory. I would like
to test with scp data transfer. Am also trying to get gridftp compiled
there with help from Raj. Build is failing with gpt problems, I think I
need Ben or Charles on this.
- Mike
On 3/18/08 3:57 PM, Ben Clifford wrote:
> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
>
> I assume that your submit host and the various machines involved have
> properly synchronised clocks, but I have not checked this beyond seeing
> that the machine I am logged into has the same time as my laptop. I have
> labelled the times taken from different system clocks with lettered clock
> domains just in case they are different.
>
> For this job, its running in thread 0-1-88.
> The karajan level job submission goes through these states (in clock
> domain A)
> 23:14:08,196-0600 Submitting
> 23:14:08,204-0600 Submitted
> 23:14:14,121-0600 Active
> 23:14:14,121-0600 Completed
>
> Note that the last two - Active and Completed - are the same (within a
> millisecond)
>
> At 23:14:14,189-0600 Swift checks the job status and finds the success
> file is not found. (This timestamp is in clock domain A)
>
> So now I look at for the status file myself on the fd filesystem:
>
> $ ls --full-time
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
>
> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
>
> (this is in clock domain B)
>
> And see that the file does exist but is a full 5 seconds after the job was
> reported as successful by provider-deef.
>
> So now we can look in the info/ directory (next to the status directory)
> and get run time stamps or the jobs.
>
> According to the info log, the job begins running at: (in clock domain B
> again) at:
>
> 00:14:14.065373000-0500
>
> which corresponds within about 60ms of the time that provider-deef
> reported the job as active.
> However, the execution according to the wrapper log shows that the job did
> not finish executing until
>
> 00:14:19.233438000-0500
>
> (which is when the status file is approximately timestamped).
>
> My off-the-cuff hypothesis is, based on the above, that soemwhere in
> provider-deef or below, the execution system is reporting a job as
> completed as soon as it starts executing, rather than when it actually
> finishes executing; and that successes with small numbers of jobs have
> been a race condition that would disappear if those small jobs took a
> substantially longer time to execute (eg if they had a sleep 30s in them).
>
More information about the Swift-devel
mailing list