[Swift-devel] fast-failing jobs

Fri Apr 11 08:43:28 CDT 2008

bug 101 discusses a class of site-selection failures that look like this:

two (or more) sites:
  site G works
  site F fails all jobs submitted to it, very rapidly.

Submit 10 non-trivial jobs for scheduling. At present, the minimum number 
of simultaneous jobs that will be sent to a site is 2. Two jobs go to site 
G, and occupy it (for eg 20 minutes); two jobs go to site F and fail 
(within eg. 10 seconds). two more jobs go to site F and fail (within eg 10 
seconds). All jobs apart from the two jobs that went to site G are 
repeatedly submitted to site F and fail, exhausting all their retries and 
causing a workflow failure.

One approach to stopping this is to slow down submission to poorly scoring 
sites. However, in this case, the delay between submissions would need to 
be on the scale of minutes .. tens of minutes to avoid this.

However, the delay needs to be on roughly the same scale as the length of 
a job, which varies widely depending on usage (some people are putting 
through half hour jobs, some people put through jobs that are a few 
seconds long). That seems difficult to determine at startup.

It seems undesirable to block a site from execution entirely based on poor 
performance because much can change over the duration of a long run 
(working sites break and non-working sites unbreak).

Related to the need for job execution length information here is stuff 
we've talked about in the past where jobs should be unselected/relaunched 
at a different site if they take 'too long', where 'too long' is 
determined based perhaps on some statistical analysis of other jobs that 
have executed successfully.

--