[Swift-devel] runaway jobs
Mihael Hategan
hategan at mcs.anl.gov
Sat Feb 7 08:36:53 CST 2009
On Sat, 2009-02-07 at 11:22 +0000, Ben Clifford wrote:
> On Fri, 6 Feb 2009, Mihael Hategan wrote:
>
> > I committed a bunch of stuff to deal with that. The idea is to kill a
> > job if it's over 2*walltime and allow swift to re-schedule it.
>
> I think this will interact poorly with clustering, due to the very
> inaccurate times at which clustered jobs go into Active and Completed
> states. Many clustered jobs will exceed their wall time in large clusters
> (for example, clusters that contain more than 2 jobs and where the
> maxwalltime is a tight bound).
>
> A job with walltime w and actual runtime (w-e) is clustered with 3 similar
> tasks, giving a cluster that will run with actual time 4w-e ~= 4w; so then
> all four of the clustered jobs will be presented to the replication
> manager layer as running for walltime 4w (> 2w).
>
> As to actually what happens when you try to cancel a clustered task at the
> moment, I'm unsure - perhaps it does nothing causing the runaway job to
> happen to have no adverse effects.
>
> It should be relatively straightforward to disable this mechanism when
> clustering is enabled; so that you can use either this or clusters but not
> both.
>
> But this and replication would be nice to use with clustering.
It can be updated to be only enabled when no clustering or clustering
and this job is a cluster. That should fix it.
More information about the Swift-devel
mailing list