[Swift-devel] Handling failures with job directory creation

Mihael Hategan hategan at mcs.anl.gov
Tue Sep 23 18:09:00 CDT 2014


Is this with coasters?

Mihael

On Mon, 2014-09-22 at 13:50 -0500, David Kelly wrote:
> When running psims on Midway, we set our scratch directory set to
> /scratch/local (a local disk mounted on each node). Occasionally
> /scratch/local gets full or becomes unmounted. When this happens, jobs are
> quickly and repeatedly sent to this bad node and get marked as failed.
> 
> Here are some ideas about how Swift could handle this better:
> 
> The Swift/swiftwrap error messages don't identify which node the directory
> creation failed on, which makes it difficult to report these errors to
> cluster admins.
> 
> If swiftwrap fails to create a job directory, the node could get marked as
> 'bad' and prevent jobs from running there.
> 
> An alternative would be to have a rule says, if using more than one node,
> never re-run a failed task on the same node. It could still be possible for
> a task to hit multiple bad nodes, but much less likely.
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel





More information about the Swift-devel mailing list