[Swift-devel] excessive rate throttling for apparently temporally-restricted failures
Ioan Raicu
iraicu at cs.uchicago.edu
Sun Oct 28 10:02:29 CDT 2007
This was the same thing that was happening to the MolDyn workflow when
we were hitting the "stale NFS handle" error, when possibly 1000s of
jobs would fail within a minute (due to a single bad node), but then
when jobs would get through again (10K+ more), the score remained low.
I fixed this in Falkon by hiding some of the known errors from Swift,
and re-trying the failed tasks, if they were due to the stale NFS handle
error. I think Mihael outlined in an email a while back how to disable
the task submission throttling due to a bad score, assuming that you
have a single site to submit to anyways.
A while back I had argued that it might be worthwhile to augment the
site score with a weighted score, when a job completion or failure is
also multiplied by the time it took for the job to complete or fail.
Also, we could change the ratio from -5:1 (failed : succesful) to
something more balanced (-1 : 1) where we are not favoring successful or
failed jobs.
Ioan
Ben Clifford wrote:
> I've been running the same workflow a few times with a high level of
> clustering. I've noticed that when there are no errors, the code will have
> up to perhaps 40 jobs running on a site; but if there is a spike of errors
> restricted in time to a minute or so, but damaging quite a large number of
> jobs, then the scheduler score for that site gets hit so hard that it
> never builds up to a reasonable value again and a very low rate is used
> for the rest of the workflow.
>
> Alternatively, aborting the workflow when this happens resets the
> scheduler score back to 0 for a fresh start and is likely to get a bunch
> of work done. It seems undesirable that 'kill workflow and restart to
> clear out the scheduler scores' is the correct action to take.
>
> I'm not particularly in a position to do rate limit / scheduler hacking at
> the moment, but I did turn on scheduler score logging in the default log
> config.
>
> If you're look at job submission rates in future, this may be useful
> information to have.
>
>
--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dsl.cs.uchicago.edu/
============================================
============================================
More information about the Swift-devel
mailing list