[Swift-devel] Re: swift-falkon problem

Ioan Raicu iraicu at cs.uchicago.edu
Thu Mar 20 17:44:18 CDT 2008


If GRAM handles the stagin in and out of data, then its true.  Falkon in 
the way that Swift is using it now does not do any data staging, so I 
don't see how Falkon can do any further  checking on the existence of 
files, on behalf of jobs.  What file would it check for?  This would 
surely involve modifying the API in the falkon provider code, for Swift 
to tell Falkon what file it needs to verify. 

If Falkon were to handle the data management, then you are right, Falkon 
would do all this checking, but currently it just treats Swift jobs as 
black boxes, and knows nothing about files or directories that need to 
exist.  Furthermore, the Falkon service could run anywhere (given that 
firewalls and NATs permit), which further complicates any kind of 
checking for files on some remote file system. 

Why could Swift not have a retry mechanism, given that it received a 
successful exit code, be more persistent in looking for the success or 
failure file, and if it doesn't exist, to try it again after some small 
amount of sleep...  this would certainly hide (and potentially solve) 
the race condition, with a persisitent enough retry mechanism, wouldn't it?

Ioan

Mihael Hategan wrote:
> On Wed, 2008-03-19 at 21:22 +0000, Ben Clifford wrote:
>   
>> On Wed, 19 Mar 2008, Michael Wilde wrote:
>>
>>     
>>> My (likely outdated) understanding of NFS protocol was that its supposed to
>>> guarantee close-to-open coherence.  Meaning that if two clients want to access
>>> a file sequentially, and the writing client closes the file before the reading
>>> client opens the file, then NFS was supposed to ensure that the reader
>>> correctly saw the existence and content of the file.
>>>       
>> Right.
>>
>> Linux NFS (but this is going back half a decade) had some problem there (I 
>> think that caused problems for GRAM2 somewhere, for example) though I do 
>> not remember the details; and it was also half a decade ago so has a good 
>> chance of being different now.
>>     
>
> I seem to remember what looked like an oddity at the time, that the GRAM
> PBS script was writing a file on the worker node and insisted that the
> script (and the job) be "done" only when the file was visible on the
> head node.
>
>   
>> A quick google did not find anything that immediately applied.
>>
>> I've also still not entirely ruled out a race somewhere in the 
>> falkon->provider-deef->swift stack reporting this.
>>
>>     
>>> If others agree that this should still be the case, then its worth 
>>> looking at our code to make sure that this is the case.  If it wasnt, 
>>> you'd think that more things would break, but perhaps Falkon exacerbates 
>>> any problems in that area due to its low latency.
>>>       
>> Indeed, the combination of falkon and local filesystem access is probably 
>> getting the time between touching the status file on one node and reading 
>> it on another down pretty low compared to other submission and file access 
>> protocols.
>>
>>     
>>> The race as far as I know is between the worker writing and moving result,
>>> info, and success status files, and the swift host seeing these, correct?
>>>       
>> That's what your logs look like today. But yesterday had different timings 
>> that suggested a different problem.
>>
>> More runs of the kind that failed would be useful, along with the 
>> corresponding falkon logs that Ioan listed in a mail in this thread.
>>
>>     
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080320/48fd6b59/attachment.html>


More information about the Swift-devel mailing list