[Swift-devel] excessive rate throttling for apparently temporally-restricted failures

Ben Clifford benc at hawaga.org.uk
Sun Oct 28 05:20:05 CDT 2007


I've been running the same workflow a few times with a high level of 
clustering. I've noticed that when there are no errors, the code will have 
up to perhaps 40 jobs running on a site; but if there is a spike of errors 
restricted in time to a minute or so, but damaging quite a large number of 
jobs, then the scheduler score for that site gets hit so hard that it 
never builds up to a reasonable value again and a very low rate is used 
for the rest of the workflow.

Alternatively, aborting the workflow when this happens resets the 
scheduler score back to 0 for a fresh start and is likely to get a bunch 
of work done. It seems undesirable that 'kill workflow and restart to 
clear out the scheduler scores' is the correct action to take.

I'm not particularly in a position to do rate limit / scheduler hacking at 
the moment, but I did turn on scheduler score logging in the default log 
config.

If you're look at job submission rates in future, this may be useful 
information to have.

-- 



More information about the Swift-devel mailing list