[Swift-devel] runaway jobs

Sat Feb 7 05:22:54 CST 2009

On Fri, 6 Feb 2009, Mihael Hategan wrote:

> I committed a bunch of stuff to deal with that. The idea is to kill a
> job if it's over 2*walltime and allow swift to re-schedule it.

I think this will interact poorly with clustering, due to the very 
inaccurate times at which clustered jobs go into Active and Completed 
states. Many clustered jobs will exceed their wall time in large clusters 
(for example, clusters that contain more than 2 jobs and where the 
maxwalltime is a tight bound).

A job with walltime w and actual runtime (w-e) is clustered with 3 similar 
tasks, giving a cluster that will run with actual time 4w-e ~= 4w; so then 
all four of the clustered jobs will be presented to the replication 
manager layer as running for walltime 4w (> 2w).

As to actually what happens when you try to cancel a clustered task at the 
moment, I'm unsure - perhaps it does nothing causing the runaway job to 
happen to have no adverse effects.

It should be relatively straightforward to disable this mechanism when 
clustering is enabled; so that you can use either this or clusters but not 
both.

But this and replication would be nice to use with clustering.

For that to happen, perhaps there needs to be some better communication 
between the clustering code and the replication code. For example, it 
could be that clusters are subject to walltime control, with walltime 
control on clustered jobs suppressed; and likewise for replication.

The replication stuff works mostly at the karajan Task level so that might 
not be an excessively arduous task.

--