[Swift-devel] fast-failing jobs
Ben Clifford
benc at hawaga.org.uk
Fri Apr 11 08:43:28 CDT 2008
bug 101 discusses a class of site-selection failures that look like this:
two (or more) sites:
site G works
site F fails all jobs submitted to it, very rapidly.
Submit 10 non-trivial jobs for scheduling. At present, the minimum number
of simultaneous jobs that will be sent to a site is 2. Two jobs go to site
G, and occupy it (for eg 20 minutes); two jobs go to site F and fail
(within eg. 10 seconds). two more jobs go to site F and fail (within eg 10
seconds). All jobs apart from the two jobs that went to site G are
repeatedly submitted to site F and fail, exhausting all their retries and
causing a workflow failure.
One approach to stopping this is to slow down submission to poorly scoring
sites. However, in this case, the delay between submissions would need to
be on the scale of minutes .. tens of minutes to avoid this.
However, the delay needs to be on roughly the same scale as the length of
a job, which varies widely depending on usage (some people are putting
through half hour jobs, some people put through jobs that are a few
seconds long). That seems difficult to determine at startup.
It seems undesirable to block a site from execution entirely based on poor
performance because much can change over the duration of a long run
(working sites break and non-working sites unbreak).
Related to the need for job execution length information here is stuff
we've talked about in the past where jobs should be unselected/relaunched
at a different site if they take 'too long', where 'too long' is
determined based perhaps on some statistical analysis of other jobs that
have executed successfully.
--
More information about the Swift-devel
mailing list