[Swift-devel] excessive rate throttling for apparently temporally-restricted failures
Ben Clifford
benc at hawaga.org.uk
Sun Oct 28 05:20:05 CDT 2007
I've been running the same workflow a few times with a high level of
clustering. I've noticed that when there are no errors, the code will have
up to perhaps 40 jobs running on a site; but if there is a spike of errors
restricted in time to a minute or so, but damaging quite a large number of
jobs, then the scheduler score for that site gets hit so hard that it
never builds up to a reasonable value again and a very low rate is used
for the rest of the workflow.
Alternatively, aborting the workflow when this happens resets the
scheduler score back to 0 for a fresh start and is likely to get a bunch
of work done. It seems undesirable that 'kill workflow and restart to
clear out the scheduler scores' is the correct action to take.
I'm not particularly in a position to do rate limit / scheduler hacking at
the moment, but I did turn on scheduler score logging in the default log
config.
If you're look at job submission rates in future, this may be useful
information to have.
--
More information about the Swift-devel
mailing list