[Swift-devel] Re: swift-falkon problem
Ioan Raicu
iraicu at cs.uchicago.edu
Tue Mar 18 16:36:21 CDT 2008
I would say that Falkon to send a successful exit code at the start of
the execution is impossible (unless its a bug that I have never seen
before)... it could certainly send a failed exit code before the task
even starts under certain conditions, but if an exit code of 0 is
received at Swift, I would say that the task executed on the remote
resource, and an exit code 0 was propagated back to Swift. Could a
latency of NFS in which one node creates a file/dir and another node
requires xxx time (in this case, 5 sec) before it actually sees the
file, explain what Mike is seeing? If this is a likely explanation,
then the race condition is that the exit code goes from worker to Falkon
service to Swift faster than NFS can update its file/dir list, and when
Swift checks for the file or dir (probably within 10s of milliseconds)
of the job completion, it can't find the file/dir. Are there any
counterarguments that would make this hypothesis not possible? Just
another hypothesis which might be worth investigating.
Ioan
Ben Clifford wrote:
> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
>
> I assume that your submit host and the various machines involved have
> properly synchronised clocks, but I have not checked this beyond seeing
> that the machine I am logged into has the same time as my laptop. I have
> labelled the times taken from different system clocks with lettered clock
> domains just in case they are different.
>
> For this job, its running in thread 0-1-88.
> The karajan level job submission goes through these states (in clock
> domain A)
> 23:14:08,196-0600 Submitting
> 23:14:08,204-0600 Submitted
> 23:14:14,121-0600 Active
> 23:14:14,121-0600 Completed
>
> Note that the last two - Active and Completed - are the same (within a
> millisecond)
>
> At 23:14:14,189-0600 Swift checks the job status and finds the success
> file is not found. (This timestamp is in clock domain A)
>
> So now I look at for the status file myself on the fd filesystem:
>
> $ ls --full-time
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
>
> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
>
> (this is in clock domain B)
>
> And see that the file does exist but is a full 5 seconds after the job was
> reported as successful by provider-deef.
>
> So now we can look in the info/ directory (next to the status directory)
> and get run time stamps or the jobs.
>
> According to the info log, the job begins running at: (in clock domain B
> again) at:
>
> 00:14:14.065373000-0500
>
> which corresponds within about 60ms of the time that provider-deef
> reported the job as active.
> However, the execution according to the wrapper log shows that the job did
> not finish executing until
>
> 00:14:19.233438000-0500
>
> (which is when the status file is approximately timestamped).
>
> My off-the-cuff hypothesis is, based on the above, that soemwhere in
> provider-deef or below, the execution system is reporting a job as
> completed as soon as it starts executing, rather than when it actually
> finishes executing; and that successes with small numbers of jobs have
> been a race condition that would disappear if those small jobs took a
> substantially longer time to execute (eg if they had a sleep 30s in them).
>
>
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
More information about the Swift-devel
mailing list