[Swift-devel] Re: swift-falkon problem
Mihael Hategan
hategan at mcs.anl.gov
Fri Mar 21 03:18:16 CDT 2008
On Thu, 2008-03-20 at 17:44 -0500, Ioan Raicu wrote:
> If GRAM handles the stagin in and out of data, then its true.
No, it's true because that's what GRAM scripts do.
> Falkon in the way that Swift is using it now does not do any data
> staging, so I don't see how Falkon can do any further checking on the
> existence of files, on behalf of jobs. What file would it check for?
Pretty much the file that GRAM checks for: one that it creates after the
executable completes. If the filesystem preserves temporal ordering on
file availability, then this will guarantee that any files created by
the job will be visible.
> This would surely involve modifying the API in the falkon provider
> code, for Swift to tell Falkon what file it needs to verify.
>
> If Falkon were to handle the data management, then you are right,
> Falkon would do all this checking, but currently it just treats Swift
> jobs as black boxes, and knows nothing about files or directories that
> need to exist. Furthermore, the Falkon service could run anywhere
> (given that firewalls and NATs permit), which further complicates any
> kind of checking for files on some remote file system.
>
> Why could Swift not have a retry mechanism, given that it received a
> successful exit code, be more persistent in looking for the success or
> failure file, and if it doesn't exist, to try it again after some
> small amount of sleep... this would certainly hide (and potentially
> solve) the race condition, with a persisitent enough retry mechanism,
> wouldn't it?
>
> Ioan
>
> Mihael Hategan wrote:
> > On Wed, 2008-03-19 at 21:22 +0000, Ben Clifford wrote:
> >
> > > On Wed, 19 Mar 2008, Michael Wilde wrote:
> > >
> > >
> > > > My (likely outdated) understanding of NFS protocol was that its supposed to
> > > > guarantee close-to-open coherence. Meaning that if two clients want to access
> > > > a file sequentially, and the writing client closes the file before the reading
> > > > client opens the file, then NFS was supposed to ensure that the reader
> > > > correctly saw the existence and content of the file.
> > > >
> > > Right.
> > >
> > > Linux NFS (but this is going back half a decade) had some problem there (I
> > > think that caused problems for GRAM2 somewhere, for example) though I do
> > > not remember the details; and it was also half a decade ago so has a good
> > > chance of being different now.
> > >
> >
> > I seem to remember what looked like an oddity at the time, that the GRAM
> > PBS script was writing a file on the worker node and insisted that the
> > script (and the job) be "done" only when the file was visible on the
> > head node.
> >
> >
> > > A quick google did not find anything that immediately applied.
> > >
> > > I've also still not entirely ruled out a race somewhere in the
> > > falkon->provider-deef->swift stack reporting this.
> > >
> > >
> > > > If others agree that this should still be the case, then its worth
> > > > looking at our code to make sure that this is the case. If it wasnt,
> > > > you'd think that more things would break, but perhaps Falkon exacerbates
> > > > any problems in that area due to its low latency.
> > > >
> > > Indeed, the combination of falkon and local filesystem access is probably
> > > getting the time between touching the status file on one node and reading
> > > it on another down pretty low compared to other submission and file access
> > > protocols.
> > >
> > >
> > > > The race as far as I know is between the worker writing and moving result,
> > > > info, and success status files, and the swift host seeing these, correct?
> > > >
> > > That's what your logs look like today. But yesterday had different timings
> > > that suggested a different problem.
> > >
> > > More runs of the kind that failed would be useful, along with the
> > > corresponding falkon logs that Ioan listed in a mail in this thread.
> > >
> > >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >
>
> --
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web: http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
>
More information about the Swift-devel
mailing list