[Swift-devel] Handling failures with job directory creation

Mihael Hategan hategan at mcs.anl.gov
Tue Sep 23 20:37:33 CDT 2014


Right. It's a known problem.

There is currently a quality measure for nodes which I think depends on
failure rate and workers with higher quality are picked first if
available. But this does not prevent bad nodes from being used if no
good nodes are available.

We could do something similar to what the swift scheduler does, which is
to blacklist bad nodes for a certain duration (an exponential back-off
sort of thing).

As for the _swiftwrap messages, please feel free to experiment with the
info() sub. In trunk, the job-to-node mapping information should be in
the log and the log tools do use it as far as I remember.

Mihael

On Tue, 2014-09-23 at 19:52 -0500, David Kelly wrote:
> Yep, it's with coasters local:slurm
> 
> On Tue, Sep 23, 2014 at 6:09 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Is this with coasters?
> >
> > Mihael
> >
> > On Mon, 2014-09-22 at 13:50 -0500, David Kelly wrote:
> > > When running psims on Midway, we set our scratch directory set to
> > > /scratch/local (a local disk mounted on each node). Occasionally
> > > /scratch/local gets full or becomes unmounted. When this happens, jobs
> > are
> > > quickly and repeatedly sent to this bad node and get marked as failed.
> > >
> > > Here are some ideas about how Swift could handle this better:
> > >
> > > The Swift/swiftwrap error messages don't identify which node the
> > directory
> > > creation failed on, which makes it difficult to report these errors to
> > > cluster admins.
> > >
> > > If swiftwrap fails to create a job directory, the node could get marked
> > as
> > > 'bad' and prevent jobs from running there.
> > >
> > > An alternative would be to have a rule says, if using more than one node,
> > > never re-run a failed task on the same node. It could still be possible
> > for
> > > a task to hit multiple bad nodes, but much less likely.
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> >
> >





More information about the Swift-devel mailing list