[Swift-devel] Re: swift-falkon problem

Ioan Raicu iraicu at cs.uchicago.edu
Tue Mar 18 16:36:21 CDT 2008


I would say that Falkon to send a successful exit code at the start of 
the execution is impossible (unless its a bug that I have never seen 
before)... it could certainly send a failed exit code before the task 
even starts under certain conditions, but if an exit code of 0 is 
received at Swift, I would say that the task executed on the remote 
resource, and an exit code 0 was propagated back to Swift.  Could a 
latency of NFS in which one node creates a file/dir and another node 
requires xxx time (in this case, 5 sec) before it actually sees the 
file, explain what Mike is seeing?  If this is a likely explanation, 
then the race condition is that the exit code goes from worker to Falkon 
service to Swift faster than NFS can update its file/dir list, and when 
Swift checks for the file or dir (probably within 10s of milliseconds) 
of the job completion, it can't find the file/dir.  Are there any 
counterarguments that would make this hypothesis not possible?  Just 
another hypothesis which might be worth investigating.

Ioan


Ben Clifford wrote:
> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
>
> I assume that your submit host and the various machines involved have 
> properly synchronised clocks, but I have not checked this beyond seeing 
> that the machine I am logged into has the same time as my laptop. I have 
> labelled the times taken from different system clocks with lettered clock 
> domains just in case they are different.
>
> For this job, its running in thread 0-1-88.
> The karajan level job submission goes through these states (in clock 
> domain A)
> 23:14:08,196-0600 Submitting
> 23:14:08,204-0600 Submitted
> 23:14:14,121-0600 Active
> 23:14:14,121-0600 Completed
>
> Note that the last two - Active and Completed - are the same (within a 
> millisecond)
>
> At 23:14:14,189-0600 Swift checks the job status and finds the success 
> file is not found. (This timestamp is in clock domain A)
>
> So now I look at for the status file myself on the fd filesystem:
>
> $ ls --full-time 
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
>
> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
>
> (this is in clock domain B)
>
> And see that the file does exist but is a full 5 seconds after the job was 
> reported as successful by provider-deef.
>
> So now we can look in the info/ directory (next to the status directory) 
> and get run time stamps or the jobs.
>
> According to the info log, the job begins running at: (in clock domain B 
> again) at:
>
> 00:14:14.065373000-0500
>
> which corresponds within about 60ms of the time that provider-deef 
> reported the job as active.
> However, the execution according to the wrapper log shows that the job did 
> not finish executing until
>
> 00:14:19.233438000-0500
>
> (which is when the status file is approximately timestamped).
>
> My off-the-cuff hypothesis is, based on the above, that soemwhere in 
> provider-deef or below, the execution system is reporting a job as 
> completed as soon as it starts executing, rather than when it actually 
> finishes executing; and that successes with small numbers of jobs have 
> been a race condition that would disappear if those small jobs took a 
> substantially longer time to execute (eg if they had a sleep 30s in them).
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================





More information about the Swift-devel mailing list