[Swift-devel] Re: swift-falkon problem

Wed Mar 19 12:09:25 CDT 2008

Mike,
The errors that occur are after the jobs execute, and Swift looks for 
the successful empty file for each job (in a unique dir per job), 
right?  If Swift were to rely solely on the exit code that Falkon 
returned for each job, would that not solve your immediate problem?  
That is not to say that a race condition might not happen elsewhere, but 
at least it would not happen in such a simple scenario, where you have 
no data dependencies.  For example, if job A (running on compute node X) 
outputs data 1, and job B (running on compute node Y) reads data 1, and 
Swift submits job B within milliseconds of job A's completion, its 
likely that job B might not find data 1 to read.  So, relying solely on 
Falkon exit codes could allow you to run your 100 jobs that have no data 
dependencies among each other just fine, and will push the race 
condition to workflows that do have some data dependencies between jobs.

Ben, Mihael, is this feasible, to use the Falkon exit code solely to 
determine the success or failure of a job?

Ioan

Michael Wilde wrote:
> Following up on Ben's request from the msg below:
>
> > My off-the-cuff hypothesis is, based on the above, that soemwhere in
> > provider-deef or below, the execution system is reporting a job as
> > completed as soon as it starts executing, rather than when it actually
> > finishes executing; and that successes with small numbers of jobs have
> > been a race condition that would disappear if those small jobs took a
> > substantially longer time to execute (eg if they had a sleep 30s in 
> them).
> >
>
> I tested the following:
>
> run314: 100 jobs, 10 workers: all finished OK
>
> run315: 100 jobs, 110 workers: ~80% failed
>
> run316: 100 jobs, 110 workers, sleep 30 in the app: all finished OK
>
> These are in ~benc/swift-logs/wilde. The workdirs are preserved on 
> bblogin/sico - I did not copy them because you need access to the msec 
> timestamps anyways.
>
> I can run these several times each to get more data before we assess 
> the hypothesis, but didnt have time yet. Let me know if thats needed.
>
> I'm cautiously leaning a bit more to the NFS-race theory. I would like 
> to test with scp data transfer.  Am also trying to get gridftp 
> compiled there with help from Raj.  Build is failing with gpt 
> problems, I think I need Ben or Charles on this.
>
> - Mike
>
>
> On 3/18/08 3:57 PM, Ben Clifford wrote:
>> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
>>
>> I assume that your submit host and the various machines involved have 
>> properly synchronised clocks, but I have not checked this beyond 
>> seeing that the machine I am logged into has the same time as my 
>> laptop. I have labelled the times taken from different system clocks 
>> with lettered clock domains just in case they are different.
>>
>> For this job, its running in thread 0-1-88.
>> The karajan level job submission goes through these states (in clock 
>> domain A)
>> 23:14:08,196-0600 Submitting
>> 23:14:08,204-0600 Submitted
>> 23:14:14,121-0600 Active
>> 23:14:14,121-0600 Completed
>>
>> Note that the last two - Active and Completed - are the same (within 
>> a millisecond)
>>
>> At 23:14:14,189-0600 Swift checks the job status and finds the 
>> success file is not found. (This timestamp is in clock domain A)
>>
>> So now I look at for the status file myself on the fd filesystem:
>>
>> $ ls --full-time 
>> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
>>
>> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
>> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
>>
>>
>> (this is in clock domain B)
>>
>> And see that the file does exist but is a full 5 seconds after the 
>> job was reported as successful by provider-deef.
>>
>> So now we can look in the info/ directory (next to the status 
>> directory) and get run time stamps or the jobs.
>>
>> According to the info log, the job begins running at: (in clock 
>> domain B again) at:
>>
>> 00:14:14.065373000-0500
>>
>> which corresponds within about 60ms of the time that provider-deef 
>> reported the job as active.
>> However, the execution according to the wrapper log shows that the 
>> job did not finish executing until
>>
>> 00:14:19.233438000-0500
>>
>> (which is when the status file is approximately timestamped).
>>
>> My off-the-cuff hypothesis is, based on the above, that soemwhere in 
>> provider-deef or below, the execution system is reporting a job as 
>> completed as soon as it starts executing, rather than when it 
>> actually finishes executing; and that successes with small numbers of 
>> jobs have been a race condition that would disappear if those small 
>> jobs took a substantially longer time to execute (eg if they had a 
>> sleep 30s in them).
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================