<div dir="ltr">Thanks, I'll file this as a ticket/future improvement item</div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Sep 23, 2014 at 8:37 PM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov" target="_blank">hategan@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Right. It's a known problem.<br>

<br>

There is currently a quality measure for nodes which I think depends on<br>

failure rate and workers with higher quality are picked first if<br>

available. But this does not prevent bad nodes from being used if no<br>

good nodes are available.<br>

<br>

We could do something similar to what the swift scheduler does, which is<br>

to blacklist bad nodes for a certain duration (an exponential back-off<br>

sort of thing).<br>

<br>

As for the _swiftwrap messages, please feel free to experiment with the<br>

info() sub. In trunk, the job-to-node mapping information should be in<br>

the log and the log tools do use it as far as I remember.<br>

<span class="HOEnZb"><font color="#888888"><br>

Mihael<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

On Tue, 2014-09-23 at 19:52 -0500, David Kelly wrote:<br>

> Yep, it's with coasters local:slurm<br>

><br>

> On Tue, Sep 23, 2014 at 6:09 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>> wrote:<br>

><br>

> > Is this with coasters?<br>

> ><br>

> > Mihael<br>

> ><br>

> > On Mon, 2014-09-22 at 13:50 -0500, David Kelly wrote:<br>

> > > When running psims on Midway, we set our scratch directory set to<br>

> > > /scratch/local (a local disk mounted on each node). Occasionally<br>

> > > /scratch/local gets full or becomes unmounted. When this happens, jobs<br>

> > are<br>

> > > quickly and repeatedly sent to this bad node and get marked as failed.<br>

> > ><br>

> > > Here are some ideas about how Swift could handle this better:<br>

> > ><br>

> > > The Swift/swiftwrap error messages don't identify which node the<br>

> > directory<br>

> > > creation failed on, which makes it difficult to report these errors to<br>

> > > cluster admins.<br>

> > ><br>

> > > If swiftwrap fails to create a job directory, the node could get marked<br>

> > as<br>

> > > 'bad' and prevent jobs from running there.<br>

> > ><br>

> > > An alternative would be to have a rule says, if using more than one node,<br>

> > > never re-run a failed task on the same node. It could still be possible<br>

> > for<br>

> > > a task to hit multiple bad nodes, but much less likely.<br>

> > > _______________________________________________<br>

> > > Swift-devel mailing list<br>

> > > <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>

> > > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>

> ><br>

> ><br>

> ><br>

<br>

<br>

</div></div></blockquote></div><br></div>