[Swift-devel] scheduler changes to deal with fast-failing sites
Ben Clifford
benc at hawaga.org.uk
Wed Jun 25 08:46:10 CDT 2008
I played around a bit with how the scheduler in Swift (actually in
Karajan) deals with low scoring sites.
Previously a site would always take at least 2 jobs at once, even in the
case a very poorly scoring site. This causes bug 101 where a site that
rapidly fails jobs can eat up all the retries for all the jobs in a
SwiftScript program, and cause the SwiftScript program as a whole to fail.
The attached patch changes that behaviour.
The scoring of well-performing sites is basically the same. Instead of a
base of 2 jobs, with more being added according to tscore * jobThrottle,
instead a base of 1 job is used. This should not cause much change in
behaviour for well-performing sites.
However, the score can now go below 1 for poorly performing sites. In that
case, a delay is enforced between submissions to a particular site. The
length of that delay increases exponentially as the site score decreases.
A few quick tests with provider-wonky suggest that this does a fairly good
job rapidly eliminating poorly performing sites running locally on my
laptop.
I'd be interested if anyone (especially Xi) tries this in a real life
multi-site situation.
In combination with replication to deal with slow-fail sites, I hope that
this makes multi-site usage of Swift work much better.
The patch, which applies against cog r2056 is at
http://www.ci.uchicago.edu/~benc/backoff-less-than-zero-1.patch
--
More information about the Swift-devel
mailing list