[Swift-devel] fast-failing jobs

Ioan Raicu iraicu at cs.uchicago.edu
Fri Apr 11 09:11:42 CDT 2008


We addressed this in Falkon by suspending bad nodes (within Falkon).  
About trying to solve the problem in general, here is an idea.

The retry counter is on a per site basis.  Lets assume the max retry is 
set to 3, and we have 4 sites, of which 3 are broken (fail fast, seconds 
per job), and only 1 site is good (computes for minutes per job).  
Assuming we have 10 jobs in total to do, within 1 minute, all 10 jobs 
will have failed 3 times per site, and the only site left that could 
potentially run these 10 jobs is the 4th site that is working at a few 
minutes per job.  Now, the 3 sites that are bad aren't penalized in any 
way, if there are jobs that have not run there yet and failed, then they 
will be tried...

This sounds like it would fix your problem, but, I am not sure how easy 
it is to keep track of the retry per site, and only fail a job if it has 
failed the max number of times at all sites!

Ioan

Ben Clifford wrote:
> bug 101 discusses a class of site-selection failures that look like this:
>
> two (or more) sites:
>   site G works
>   site F fails all jobs submitted to it, very rapidly.
>
> Submit 10 non-trivial jobs for scheduling. At present, the minimum number 
> of simultaneous jobs that will be sent to a site is 2. Two jobs go to site 
> G, and occupy it (for eg 20 minutes); two jobs go to site F and fail 
> (within eg. 10 seconds). two more jobs go to site F and fail (within eg 10 
> seconds). All jobs apart from the two jobs that went to site G are 
> repeatedly submitted to site F and fail, exhausting all their retries and 
> causing a workflow failure.
>
> One approach to stopping this is to slow down submission to poorly scoring 
> sites. However, in this case, the delay between submissions would need to 
> be on the scale of minutes .. tens of minutes to avoid this.
>
> However, the delay needs to be on roughly the same scale as the length of 
> a job, which varies widely depending on usage (some people are putting 
> through half hour jobs, some people put through jobs that are a few 
> seconds long). That seems difficult to determine at startup.
>
> It seems undesirable to block a site from execution entirely based on poor 
> performance because much can change over the duration of a long run 
> (working sites break and non-working sites unbreak).
>
> Related to the need for job execution length information here is stuff 
> we've talked about in the past where jobs should be unselected/relaunched 
> at a different site if they take 'too long', where 'too long' is 
> determined based perhaps on some statistical analysis of other jobs that 
> have executed successfully.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================





More information about the Swift-devel mailing list