[Swift-devel] Handling failures with job directory creation

David Kelly davidkelly at uchicago.edu
Tue Sep 23 19:52:15 CDT 2014


Yep, it's with coasters local:slurm

On Tue, Sep 23, 2014 at 6:09 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Is this with coasters?
>
> Mihael
>
> On Mon, 2014-09-22 at 13:50 -0500, David Kelly wrote:
> > When running psims on Midway, we set our scratch directory set to
> > /scratch/local (a local disk mounted on each node). Occasionally
> > /scratch/local gets full or becomes unmounted. When this happens, jobs
> are
> > quickly and repeatedly sent to this bad node and get marked as failed.
> >
> > Here are some ideas about how Swift could handle this better:
> >
> > The Swift/swiftwrap error messages don't identify which node the
> directory
> > creation failed on, which makes it difficult to report these errors to
> > cluster admins.
> >
> > If swiftwrap fails to create a job directory, the node could get marked
> as
> > 'bad' and prevent jobs from running there.
> >
> > An alternative would be to have a rule says, if using more than one node,
> > never re-run a failed task on the same node. It could still be possible
> for
> > a task to hit multiple bad nodes, but much less likely.
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140923/3c4780ac/attachment.html>


More information about the Swift-devel mailing list