<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Hi,<br>

Here is a quick recap of the 244 MolDyn run we made this weekend...<br>

<br>

I have posted the logs and graphs at:

<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/</a><br>

<br>

11079 failed with a -1. <br>

2 failed with an exit code of 127.<br>

<br>

Inspecting the logs revealed the infamous stale NFS

handle error!<br>

<br>

A single machine (192.5.198.37) had all the failed tasks (11081 tasks);

the machine was not completely broken, as it did complete 4 tasks

successfully, although the completion times were considerably higher

than the other machines.<br>

  <br>

20836 tasks finished with an exit code 0.<br>

<br>

I was expecting 20497 tasks broken down as follows:<br>

<br>

<table x:str="" style="border-collapse: collapse; width: 144pt;"

 border="0" cellpadding="0" cellspacing="0" width="192">

  <col style="width: 48pt;" span="3" width="64"> <tbody>

    <tr style="height: 12.75pt;" height="17">

      <td style="height: 12.75pt; width: 48pt;" x:num="" align="right"

 height="17" width="64">1</td>

      <td style="width: 48pt;" x:num="" align="right" width="64">1</td>

      <td style="width: 48pt;" x:num="" x:fmla="=A1*B1" align="right"

 width="64">1</td>

    </tr>

    <tr style="height: 12.75pt;" height="17">

      <td style="height: 12.75pt;" x:num="" align="right" height="17">1</td>

      <td x:num="" align="right">244</td>

      <td x:num="" x:fmla="=A2*B2" align="right">244</td>

    </tr>

    <tr style="height: 12.75pt;" height="17">

      <td style="height: 12.75pt;" x:num="" align="right" height="17">1</td>

      <td x:num="" align="right">244</td>

      <td x:num="" x:fmla="=A3*B3" align="right">244</td>

    </tr>

    <tr style="height: 12.75pt;" height="17">

      <td style="height: 12.75pt;" x:num="" align="right" height="17">68</td>

      <td x:num="" align="right">244</td>

      <td x:num="" x:fmla="=A4*B4" align="right">16592</td>

    </tr>

    <tr style="height: 12.75pt;" height="17">

      <td style="height: 12.75pt;" x:num="" align="right" height="17">1</td>

      <td x:num="" align="right">244</td>

      <td x:num="" x:fmla="=A5*B5" align="right">244</td>

    </tr>

    <tr style="height: 12.75pt;" height="17">

      <td style="height: 12.75pt;" x:num="" align="right" height="17">11</td>

      <td x:num="" align="right">244</td>

      <td x:num="" x:fmla="=A6*B6" align="right">2684</td>

    </tr>

    <tr style="height: 12.75pt;" height="17">

      <td style="height: 12.75pt;" x:num="" align="right" height="17">1</td>

      <td x:num="" align="right">244</td>

      <td x:num="" x:fmla="=A7*B7" align="right">244</td>

    </tr>

    <tr style="height: 12.75pt;" height="17">

      <td style="height: 12.75pt;" x:num="" align="right" height="17">1</td>

      <td x:num="" align="right">244</td>

      <td x:num="" x:fmla="=A8*B8" align="right">244</td>

    </tr>

    <tr style="height: 12.75pt;" height="17">

      <td style="height: 12.75pt;" height="17"><br>

      </td>

      <td><br>

      </td>

      <td><br>

      </td>

    </tr>

    <tr style="height: 12.75pt;" height="17">

      <td style="height: 12.75pt;" height="17"><br>

      </td>

      <td><br>

      </td>

      <td x:num="" x:fmla="=SUM(C1:C9)" align="right">20497</td>

    </tr>

  </tbody>

</table>

<br>

I do not know why there were 339 more tasks than we were expecting.<br>

<br>

A close look at the summary graph

(<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg</a>),

we see that after the large number of failed tasks, the queue length

(blue line) quickly went to 0, and then stayed there as Swift was

trickling in only about 100 tasks at a time.  For the rest of the

experiment, only about 100 tasks at a time were ever running.  This is

not the first time we have seen this, and it seems that is only showing

up when there is a bad machine failing many tasks, and essentially

Swift doesn't try to resubmit them fast, and the jobs only trickle in

thereafter not keeping all the processors busy.  <br>

<br>

When we had runs with no bad nodes and no large number of failures,

this did not happen, and Swift essentially submitted all independent

tasks to Falkon.  I know there is a heuristic within Karajan that is

probably affecting the submit rate of tasks after the large number of

failures happened, but I think it needs to be tuned to recover from

large number of failures so in time, it again attempts to send more.  A

good analogy is TCP, think of its window size increasing larger and

larger, but then a large number of packets get lost, and TCP collapses

its window size, but then never recovering from this and remaining with

a small window size for the rest of the connection, regardless of the

fact that it could again increase the window size until the next round

of lost packets...  I believe the normal behavior should allow Swift to

recover and again submit many tasks to Falkon.  If this heuristic

cannot be easily tweaked or made to recover from the "window collapse",

could we disable it when we are running on Falkon at a single site?<br>

<br>

BTW, here were the graphs from a previous run when only the last few

jobs didn't finish due to a bug in the application code.  <br>

<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/</a><br>

In this run, notice that there were no bad nodes that caused many tasks

to fail, and Swift submitted many tasks to Falkon, and managed to keep

all processors busy!  <br>

<br>

I think we can call the 244-mol MolDyn run a success, both the current

run and the previous run from 7-16-07 that almost finished!<br>

<br>

We need to figure out how to control the job throttling better, and

perhaps on how to automatically detect this plaguing problem with

"Stale NFS handle", and possibly contain the damage to significantly

fewer task failures.  I also think that increasing the # of retries

from Swift's end should be considered when running over Falkon.  Notice

that a single worker can fail as many as 1000 tasks per minute, which

are many tasks given that when the NFS stale handle shows up, its

around for tens of seconds to minutes at a time.  <br>

<br>

BTW, the run we just made consummed about 1556.9 CPU hours (937.7 used

and 619.2 wasted) in 8.5 hours.  In contrast, the run we made on

7-16-07 which almost finished, but behaved much better since there were

no node failures, consumed about 866.4 CPU hours (866.3 used and 0.1

wasted) in 4.18 hours.  <br>

<br>

When Nika comes back from vacation, we can try the real application,

which should consume some 16K CPU hours (service units)!   She also has

her own temporary allocation at ANL/UC now, so we can use that!<br>

<br>

Ioan<br>

<br>

Ioan Raicu wrote:

<blockquote cite="mid:46BDB67D.2040207@cs.uchicago.edu" type="cite">

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

I think  the workflow finally completed successfully, but there are

still some oddities in the way the logs look (especially job

throttling, a few hundred more jobs than I was expecting, etc).  At

least, we have all the output we needed for every molecule!<br>

  <br>

I'll write up a summary of what happened, and draw up some nice graphs,

and send it out later today.<br>

  <br>

Ioan<br>

  <br>

iraicu@viper:/home/nefedova/alamines> ls fe_* | wc<br>

    488     488    6832<br>

  <br>

  <blockquote cite="mid:1186787247.8088.2.camel@blabla.mcs.anl.gov"

 type="cite">

    <pre wrap="">

  </pre>

  </blockquote>

</blockquote>

</body>

</html>