[Swift-devel] scheduler changes to deal with fast-failing sites

Wed Jun 25 08:46:10 CDT 2008

I played around a bit with how the scheduler in Swift (actually in 
Karajan) deals with low scoring sites.

Previously a site would always take at least 2 jobs at once, even in the 
case a very poorly scoring site. This causes bug 101 where a site that 
rapidly fails jobs can eat up all the retries for all the jobs in a 
SwiftScript program, and cause the SwiftScript program as a whole to fail.

The attached patch changes that behaviour.

The scoring of well-performing sites is basically the same. Instead of a 
base of 2 jobs, with more being added according to tscore * jobThrottle, 
instead a base of 1 job is used. This should not cause much change in 
behaviour for well-performing sites.

However, the score can now go below 1 for poorly performing sites. In that 
case, a delay is enforced between submissions to a particular site. The 
length of that delay increases exponentially as the site score decreases.

A few quick tests with provider-wonky suggest that this does a fairly good 
job rapidly eliminating poorly performing sites running locally on my 
laptop.

I'd be interested if anyone (especially Xi) tries this in a real life 
multi-site situation.

In combination with replication to deal with slow-fail sites, I hope that 
this makes multi-site usage of Swift work much better.

The patch, which applies against cog r2056 is at
http://www.ci.uchicago.edu/~benc/backoff-less-than-zero-1.patch

--