[Swift-devel] fast-failing jobs

Mihael Hategan hategan at mcs.anl.gov
Sat Apr 12 13:08:43 CDT 2008


On Fri, 2008-04-11 at 13:43 +0000, Ben Clifford wrote:
> bug 101 discusses a class of site-selection failures that look like this:
> 
> two (or more) sites:
>   site G works
>   site F fails all jobs submitted to it, very rapidly.
> 
> Submit 10 non-trivial jobs for scheduling. At present, the minimum number 
> of simultaneous jobs that will be sent to a site is 2. Two jobs go to site 
> G, and occupy it (for eg 20 minutes); two jobs go to site F and fail 
> (within eg. 10 seconds). two more jobs go to site F and fail (within eg 10 
> seconds). All jobs apart from the two jobs that went to site G are 
> repeatedly submitted to site F and fail, exhausting all their retries and 
> causing a workflow failure.
> 
> One approach to stopping this is to slow down submission to poorly scoring 
> sites. However, in this case, the delay between submissions would need to 
> be on the scale of minutes .. tens of minutes to avoid this.
> 
> However, the delay needs to be on roughly the same scale as the length of 
> a job, which varies widely depending on usage (some people are putting 
> through half hour jobs, some people put through jobs that are a few 
> seconds long).

That's pretty much what a low score does if there's throttling based on
score. Perhaps our solution is to have a low job throttle and a higher
score range (i.e. T=1000 instead of 100).

That or we could enforce a submission rate (j/s) based on score.

>  That seems difficult to determine at startup.

That, again, is in the nature of things. A good score approximation is
difficult to determine at startup.






More information about the Swift-devel mailing list