<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
If GRAM handles the stagin in and out of data, then its true. Falkon
in the way that Swift is using it now does not do any data staging, so
I don't see how Falkon can do any further checking on the existence of
files, on behalf of jobs. What file would it check for? This would
surely involve modifying the API in the falkon provider code, for Swift
to tell Falkon what file it needs to verify. <br>
<br>
If Falkon were to handle the data management, then you are right,
Falkon would do all this checking, but currently it just treats Swift
jobs as black boxes, and knows nothing about files or directories that
need to exist. Furthermore, the Falkon service could run anywhere
(given that firewalls and NATs permit), which further complicates any
kind of checking for files on some remote file system. <br>
<br>
Why could Swift not have a retry mechanism, given that it received a
successful exit code, be more persistent in looking for the success or
failure file, and if it doesn't exist, to try it again after some small
amount of sleep... this would certainly hide (and potentially solve)
the race condition, with a persisitent enough retry mechanism, wouldn't
it?<br>
<br>
Ioan<br>
<br>
Mihael Hategan wrote:
<blockquote cite="mid:1206050843.4091.9.camel@blabla.mcs.anl.gov"
type="cite">
<pre wrap="">On Wed, 2008-03-19 at 21:22 +0000, Ben Clifford wrote:
</pre>
<blockquote type="cite">
<pre wrap="">On Wed, 19 Mar 2008, Michael Wilde wrote:
</pre>
<blockquote type="cite">
<pre wrap="">My (likely outdated) understanding of NFS protocol was that its supposed to
guarantee close-to-open coherence. Meaning that if two clients want to access
a file sequentially, and the writing client closes the file before the reading
client opens the file, then NFS was supposed to ensure that the reader
correctly saw the existence and content of the file.
</pre>
</blockquote>
<pre wrap="">Right.
Linux NFS (but this is going back half a decade) had some problem there (I
think that caused problems for GRAM2 somewhere, for example) though I do
not remember the details; and it was also half a decade ago so has a good
chance of being different now.
</pre>
</blockquote>
<pre wrap=""><!---->
I seem to remember what looked like an oddity at the time, that the GRAM
PBS script was writing a file on the worker node and insisted that the
script (and the job) be "done" only when the file was visible on the
head node.
</pre>
<blockquote type="cite">
<pre wrap="">A quick google did not find anything that immediately applied.
I've also still not entirely ruled out a race somewhere in the
falkon->provider-deef->swift stack reporting this.
</pre>
<blockquote type="cite">
<pre wrap="">If others agree that this should still be the case, then its worth
looking at our code to make sure that this is the case. If it wasnt,
you'd think that more things would break, but perhaps Falkon exacerbates
any problems in that area due to its low latency.
</pre>
</blockquote>
<pre wrap="">Indeed, the combination of falkon and local filesystem access is probably
getting the time between touching the status file on one node and reading
it on another down pretty low compared to other submission and file access
protocols.
</pre>
<blockquote type="cite">
<pre wrap="">The race as far as I know is between the worker writing and moving result,
info, and success status files, and the swift host seeing these, correct?
</pre>
</blockquote>
<pre wrap="">That's what your logs look like today. But yesterday had different timings
that suggested a different problem.
More runs of the kind that failed would be useful, along with the
corresponding falkon logs that Ioan listed in a mail in this thread.
</pre>
</blockquote>
<pre wrap=""><!---->
_______________________________________________
Swift-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a>
<a class="moz-txt-link-freetext" href="http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel">http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@cs.uchicago.edu">iraicu@cs.uchicago.edu</a>
Web: <a class="moz-txt-link-freetext" href="http://www.cs.uchicago.edu/~iraicu">http://www.cs.uchicago.edu/~iraicu</a>
<a class="moz-txt-link-freetext" href="http://dev.globus.org/wiki/Incubator/Falkon">http://dev.globus.org/wiki/Incubator/Falkon</a>
<a class="moz-txt-link-freetext" href="http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page">http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page</a>
===================================================
===================================================
</pre>
</body>
</html>