[Swift-devel] Re: swift-falkon problem

Mihael Hategan hategan at mcs.anl.gov
Wed Mar 19 03:25:50 CDT 2008


On Tue, 2008-03-18 at 16:36 -0500, Ioan Raicu wrote:
> I would say that Falkon to send a successful exit code at the start of 
> the execution is impossible (unless its a bug that I have never seen 
> before)... 

:)
Like any new bug?

> Ioan
> 
> 
> Ben Clifford wrote:
> > I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
> >
> > I assume that your submit host and the various machines involved have 
> > properly synchronised clocks, but I have not checked this beyond seeing 
> > that the machine I am logged into has the same time as my laptop. I have 
> > labelled the times taken from different system clocks with lettered clock 
> > domains just in case they are different.
> >
> > For this job, its running in thread 0-1-88.
> > The karajan level job submission goes through these states (in clock 
> > domain A)
> > 23:14:08,196-0600 Submitting
> > 23:14:08,204-0600 Submitted
> > 23:14:14,121-0600 Active
> > 23:14:14,121-0600 Completed
> >
> > Note that the last two - Active and Completed - are the same (within a 
> > millisecond)
> >
> > At 23:14:14,189-0600 Swift checks the job status and finds the success 
> > file is not found. (This timestamp is in clock domain A)
> >
> > So now I look at for the status file myself on the fd filesystem:
> >
> > $ ls --full-time 
> > /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
> >
> > -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
> > /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
> >
> > (this is in clock domain B)
> >
> > And see that the file does exist but is a full 5 seconds after the job was 
> > reported as successful by provider-deef.
> >
> > So now we can look in the info/ directory (next to the status directory) 
> > and get run time stamps or the jobs.
> >
> > According to the info log, the job begins running at: (in clock domain B 
> > again) at:
> >
> > 00:14:14.065373000-0500
> >
> > which corresponds within about 60ms of the time that provider-deef 
> > reported the job as active.
> > However, the execution according to the wrapper log shows that the job did 
> > not finish executing until
> >
> > 00:14:19.233438000-0500
> >
> > (which is when the status file is approximately timestamped).
> >
> > My off-the-cuff hypothesis is, based on the above, that soemwhere in 
> > provider-deef or below, the execution system is reporting a job as 
> > completed as soon as it starts executing, rather than when it actually 
> > finishes executing; and that successes with small numbers of jobs have 
> > been a race condition that would disappear if those small jobs took a 
> > substantially longer time to execute (eg if they had a sleep 30s in them).
> >
> >   
> 
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list