[Swift-devel] Handling failures with job directory creation
David Kelly
davidkelly at uchicago.edu
Mon Sep 22 13:50:10 CDT 2014
When running psims on Midway, we set our scratch directory set to
/scratch/local (a local disk mounted on each node). Occasionally
/scratch/local gets full or becomes unmounted. When this happens, jobs are
quickly and repeatedly sent to this bad node and get marked as failed.
Here are some ideas about how Swift could handle this better:
The Swift/swiftwrap error messages don't identify which node the directory
creation failed on, which makes it difficult to report these errors to
cluster admins.
If swiftwrap fails to create a job directory, the node could get marked as
'bad' and prevent jobs from running there.
An alternative would be to have a rule says, if using more than one node,
never re-run a failed task on the same node. It could still be possible for
a task to hit multiple bad nodes, but much less likely.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140922/abb4691b/attachment.html>
More information about the Swift-devel
mailing list