[Swift-devel] fast-failing jobs
Mihael Hategan
hategan at mcs.anl.gov
Sat Apr 12 13:08:43 CDT 2008
On Fri, 2008-04-11 at 13:43 +0000, Ben Clifford wrote:
> bug 101 discusses a class of site-selection failures that look like this:
>
> two (or more) sites:
> site G works
> site F fails all jobs submitted to it, very rapidly.
>
> Submit 10 non-trivial jobs for scheduling. At present, the minimum number
> of simultaneous jobs that will be sent to a site is 2. Two jobs go to site
> G, and occupy it (for eg 20 minutes); two jobs go to site F and fail
> (within eg. 10 seconds). two more jobs go to site F and fail (within eg 10
> seconds). All jobs apart from the two jobs that went to site G are
> repeatedly submitted to site F and fail, exhausting all their retries and
> causing a workflow failure.
>
> One approach to stopping this is to slow down submission to poorly scoring
> sites. However, in this case, the delay between submissions would need to
> be on the scale of minutes .. tens of minutes to avoid this.
>
> However, the delay needs to be on roughly the same scale as the length of
> a job, which varies widely depending on usage (some people are putting
> through half hour jobs, some people put through jobs that are a few
> seconds long).
That's pretty much what a low score does if there's throttling based on
score. Perhaps our solution is to have a low job throttle and a higher
score range (i.e. T=1000 instead of 100).
That or we could enforce a submission rate (j/s) based on score.
> That seems difficult to determine at startup.
That, again, is in the nature of things. A good score approximation is
difficult to determine at startup.
More information about the Swift-devel
mailing list