[Swift-devel] excessive rate throttling for apparently temporally-restricted failures

Ioan Raicu iraicu at cs.uchicago.edu
Sun Oct 28 10:02:29 CDT 2007


This was the same thing that was happening to the MolDyn workflow when 
we were hitting the "stale NFS handle" error, when possibly 1000s of 
jobs would fail within a minute (due to a single bad node), but then 
when jobs would get through again (10K+ more), the score remained low.  
I fixed this in Falkon by hiding some of the known errors from Swift, 
and re-trying the failed tasks, if they were due to the stale NFS handle 
error.  I think Mihael outlined in an email a while back how to disable 
the task submission throttling due to a bad score, assuming that you 
have a single site to submit to anyways. 

A while back I had argued that it might be worthwhile to augment the 
site score with a weighted score, when a job completion or failure is 
also multiplied by the time it took for the job to complete or fail.  
Also, we could change the ratio from -5:1 (failed : succesful) to 
something more balanced (-1 : 1) where we are not favoring successful or 
failed jobs.

Ioan

Ben Clifford wrote:
> I've been running the same workflow a few times with a high level of 
> clustering. I've noticed that when there are no errors, the code will have 
> up to perhaps 40 jobs running on a site; but if there is a spike of errors 
> restricted in time to a minute or so, but damaging quite a large number of 
> jobs, then the scheduler score for that site gets hit so hard that it 
> never builds up to a reasonable value again and a very low rate is used 
> for the rest of the workflow.
>
> Alternatively, aborting the workflow when this happens resets the 
> scheduler score back to 0 for a fresh start and is likely to get a bunch 
> of work done. It seems undesirable that 'kill workflow and restart to 
> clear out the scheduler scores' is the correct action to take.
>
> I'm not particularly in a position to do rate limit / scheduler hacking at 
> the moment, but I did turn on scheduler score logging in the default log 
> config.
>
> If you're look at job submission rates in future, this may be useful 
> information to have.
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================




More information about the Swift-devel mailing list