[Swift-devel] runaway jobs

Sat Feb 7 08:36:53 CST 2009

On Sat, 2009-02-07 at 11:22 +0000, Ben Clifford wrote:
> On Fri, 6 Feb 2009, Mihael Hategan wrote:
> 
> > I committed a bunch of stuff to deal with that. The idea is to kill a
> > job if it's over 2*walltime and allow swift to re-schedule it.
> 
> I think this will interact poorly with clustering, due to the very 
> inaccurate times at which clustered jobs go into Active and Completed 
> states. Many clustered jobs will exceed their wall time in large clusters 
> (for example, clusters that contain more than 2 jobs and where the 
> maxwalltime is a tight bound).
> 
> A job with walltime w and actual runtime (w-e) is clustered with 3 similar 
> tasks, giving a cluster that will run with actual time 4w-e ~= 4w; so then 
> all four of the clustered jobs will be presented to the replication 
> manager layer as running for walltime 4w (> 2w).
> 
> As to actually what happens when you try to cancel a clustered task at the 
> moment, I'm unsure - perhaps it does nothing causing the runaway job to 
> happen to have no adverse effects.
> 
> It should be relatively straightforward to disable this mechanism when 
> clustering is enabled; so that you can use either this or clusters but not 
> both.
> 
> But this and replication would be nice to use with clustering.

It can be updated to be only enabled when no clustering or clustering
and this job is a cluster. That should fix it.