[Swift-devel] Re: swift-falkon problem

Mihael Hategan hategan at mcs.anl.gov
Wed Mar 19 15:48:57 CDT 2008


If there is a race condition, we need to find it and address it, because
it applies to more than just the status file.

Mihael

On Wed, 2008-03-19 at 12:09 -0500, Ioan Raicu wrote:
> Mike,
> The errors that occur are after the jobs execute, and Swift looks for 
> the successful empty file for each job (in a unique dir per job), 
> right?  If Swift were to rely solely on the exit code that Falkon 
> returned for each job, would that not solve your immediate problem?
> That is not to say that a race condition might not happen elsewhere, but 
> at least it would not happen in such a simple scenario, where you have 
> no data dependencies.  For example, if job A (running on compute node X) 
> outputs data 1, and job B (running on compute node Y) reads data 1, and 
> Swift submits job B within milliseconds of job A's completion, its 
> likely that job B might not find data 1 to read.  So, relying solely on 
> Falkon exit codes could allow you to run your 100 jobs that have no data 
> dependencies among each other just fine, and will push the race 
> condition to workflows that do have some data dependencies between jobs.
> 
> Ben, Mihael, is this feasible, to use the Falkon exit code solely to 
> determine the success or failure of a job?
> 
> Ioan
> 
> Michael Wilde wrote:
> > Following up on Ben's request from the msg below:
> >
> > > My off-the-cuff hypothesis is, based on the above, that soemwhere in
> > > provider-deef or below, the execution system is reporting a job as
> > > completed as soon as it starts executing, rather than when it actually
> > > finishes executing; and that successes with small numbers of jobs have
> > > been a race condition that would disappear if those small jobs took a
> > > substantially longer time to execute (eg if they had a sleep 30s in 
> > them).
> > >
> >
> > I tested the following:
> >
> > run314: 100 jobs, 10 workers: all finished OK
> >
> > run315: 100 jobs, 110 workers: ~80% failed
> >
> > run316: 100 jobs, 110 workers, sleep 30 in the app: all finished OK
> >
> > These are in ~benc/swift-logs/wilde. The workdirs are preserved on 
> > bblogin/sico - I did not copy them because you need access to the msec 
> > timestamps anyways.
> >
> > I can run these several times each to get more data before we assess 
> > the hypothesis, but didnt have time yet. Let me know if thats needed.
> >
> > I'm cautiously leaning a bit more to the NFS-race theory. I would like 
> > to test with scp data transfer.  Am also trying to get gridftp 
> > compiled there with help from Raj.  Build is failing with gpt 
> > problems, I think I need Ben or Charles on this.
> >
> > - Mike
> >
> >
> > On 3/18/08 3:57 PM, Ben Clifford wrote:
> >> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
> >>
> >> I assume that your submit host and the various machines involved have 
> >> properly synchronised clocks, but I have not checked this beyond 
> >> seeing that the machine I am logged into has the same time as my 
> >> laptop. I have labelled the times taken from different system clocks 
> >> with lettered clock domains just in case they are different.
> >>
> >> For this job, its running in thread 0-1-88.
> >> The karajan level job submission goes through these states (in clock 
> >> domain A)
> >> 23:14:08,196-0600 Submitting
> >> 23:14:08,204-0600 Submitted
> >> 23:14:14,121-0600 Active
> >> 23:14:14,121-0600 Completed
> >>
> >> Note that the last two - Active and Completed - are the same (within 
> >> a millisecond)
> >>
> >> At 23:14:14,189-0600 Swift checks the job status and finds the 
> >> success file is not found. (This timestamp is in clock domain A)
> >>
> >> So now I look at for the status file myself on the fd filesystem:
> >>
> >> $ ls --full-time 
> >> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
> >>
> >> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
> >> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
> >>
> >>
> >> (this is in clock domain B)
> >>
> >> And see that the file does exist but is a full 5 seconds after the 
> >> job was reported as successful by provider-deef.
> >>
> >> So now we can look in the info/ directory (next to the status 
> >> directory) and get run time stamps or the jobs.
> >>
> >> According to the info log, the job begins running at: (in clock 
> >> domain B again) at:
> >>
> >> 00:14:14.065373000-0500
> >>
> >> which corresponds within about 60ms of the time that provider-deef 
> >> reported the job as active.
> >> However, the execution according to the wrapper log shows that the 
> >> job did not finish executing until
> >>
> >> 00:14:19.233438000-0500
> >>
> >> (which is when the status file is approximately timestamped).
> >>
> >> My off-the-cuff hypothesis is, based on the above, that soemwhere in 
> >> provider-deef or below, the execution system is reporting a job as 
> >> completed as soon as it starts executing, rather than when it 
> >> actually finishes executing; and that successes with small numbers of 
> >> jobs have been a race condition that would disappear if those small 
> >> jobs took a substantially longer time to execute (eg if they had a 
> >> sleep 30s in them).
> >>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> 




More information about the Swift-devel mailing list