<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
<span style="width: 500px;"><font size="-1">Here is an excerpt from an
email on 6/19. <br>
>> It completed 10998 <br>
>> tasks (8402 tasks with an exit code of 0, and 2596 tasks with
an exit <br>
>> code of -1 -- aka failed) in 13399 seconds on 200 processors,
this <br>
>> was for the 100 molecule run! The failed tasks were all on the
same <br>
>> node over several short time intervals (~30 seconds), and were
due to <br>
>> a "<span class="hl">Stale</span> <span class="hl">NFS</span>
file <span class="hl">handle</span>", probably due to having 200
processes <br>
>> hitting the shared file system at the same time. Note that all
these <br>
>> 2596 failed tasks were restarted by Swift and completed
successfully <br>
>> on the resubmission. In the end, everything went through, and
the run <br>
>> was successful!</font></span><br>
<br>
We noticed the same node in later runs act up, and take on the order of
100 times longer to complete some tasks than it was supposed to take.
I bet this node is having some hardware issues, and we should write to
help@tg to tell them.<br>
<br>
The failed tasks were eventually retried, and succeeded, and the whole
run was successful, but the question is, why were the 2596 failed tasks
(which were all independent of each other) not submitted faster after
they failed... I would have expected them to fill up the wait queue
with these 2596 retried tasks.<br>
<br>
Ioan<br>
<br>
Ben Clifford wrote:
<blockquote
cite="mid:Pine.LNX.4.64.0706221711110.1452@dildano.hawaga.org.uk"
type="cite">
<pre wrap="">
On Fri, 22 Jun 2007, Ioan Raicu wrote:
</pre>
<blockquote type="cite">
<pre wrap="">I believe it could have send out more. For example, there were 2500+ tasks
that failed in the middle of those 6800 tasks (which were all independent),
why were 2500 tasks not resubmitted all at once... they were each about 200
seconds long, so most of them should have certainly showed up in the wait
queue.
</pre>
</blockquote>
<pre wrap=""><!---->
what kind of failure?
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@cs.uchicago.edu">iraicu@cs.uchicago.edu</a>
Web: <a class="moz-txt-link-freetext" href="http://www.cs.uchicago.edu/~iraicu">http://www.cs.uchicago.edu/~iraicu</a>
<a class="moz-txt-link-freetext" href="http://dsl.cs.uchicago.edu/">http://dsl.cs.uchicago.edu/</a>
============================================
============================================
</pre>
</body>
</html>