[Swift-devel] Handling failures with job directory creation

Tue Sep 23 20:44:36 CDT 2014

Thanks, I'll file this as a ticket/future improvement item

On Tue, Sep 23, 2014 at 8:37 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Right. It's a known problem.
>
> There is currently a quality measure for nodes which I think depends on
> failure rate and workers with higher quality are picked first if
> available. But this does not prevent bad nodes from being used if no
> good nodes are available.
>
> We could do something similar to what the swift scheduler does, which is
> to blacklist bad nodes for a certain duration (an exponential back-off
> sort of thing).
>
> As for the _swiftwrap messages, please feel free to experiment with the
> info() sub. In trunk, the job-to-node mapping information should be in
> the log and the log tools do use it as far as I remember.
>
> Mihael
>
> On Tue, 2014-09-23 at 19:52 -0500, David Kelly wrote:
> > Yep, it's with coasters local:slurm
> >
> > On Tue, Sep 23, 2014 at 6:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> >
> > > Is this with coasters?
> > >
> > > Mihael
> > >
> > > On Mon, 2014-09-22 at 13:50 -0500, David Kelly wrote:
> > > > When running psims on Midway, we set our scratch directory set to
> > > > /scratch/local (a local disk mounted on each node). Occasionally
> > > > /scratch/local gets full or becomes unmounted. When this happens,
> jobs
> > > are
> > > > quickly and repeatedly sent to this bad node and get marked as
> failed.
> > > >
> > > > Here are some ideas about how Swift could handle this better:
> > > >
> > > > The Swift/swiftwrap error messages don't identify which node the
> > > directory
> > > > creation failed on, which makes it difficult to report these errors
> to
> > > > cluster admins.
> > > >
> > > > If swiftwrap fails to create a job directory, the node could get
> marked
> > > as
> > > > 'bad' and prevent jobs from running there.
> > > >
> > > > An alternative would be to have a rule says, if using more than one
> node,
> > > > never re-run a failed task on the same node. It could still be
> possible
> > > for
> > > > a task to hit multiple bad nodes, but much less likely.
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >
> > >
> > >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140923/b8241452/attachment.html>