<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<br>

<br>

Mihael Hategan wrote:

<blockquote cite="mid:1186938048.24879.8.camel@blabla.mcs.anl.gov"

 type="cite">

  <pre wrap="">On Sun, 2007-08-12 at 00:22 -0500, Ioan Raicu wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">Hi,

Here is a quick recap of the 244 MolDyn run we made this weekend...

I have posted the logs and graphs at:

<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/</a>

11079 failed with a -1. 

2 failed with an exit code of 127.

Inspecting the logs revealed the infamous stale NFS handle error!

A single machine (192.5.198.37) had all the failed tasks (11081

tasks); the machine was not completely broken, as it did complete 4

tasks successfully, although the completion times were considerably

higher than the other machines.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

It seems a bit inefficient that 1/3 of the tasks would go to the one

machine (out of a fairly large number) that consistently fails tasks.

  </pre>

</blockquote>

Only one machine had problems with the GPFS mount.  The errors we

happening within the first 10 ms or so, and the communication overhead

was around 20~30 ms, so we are talking about a bad machine that is

failing tasks every 30~40 ms. while other machines that were operating

normally had jobs lasting a few minutes.  Now, the GPFS mount errors

came in bursts of some tens of seconds to maybe a minute or two

(several of these), in which it failed all the tasks in a few batches.<br>

<blockquote cite="mid:1186938048.24879.8.camel@blabla.mcs.anl.gov"

 type="cite">

  <pre wrap=""></pre>

  <blockquote type="cite">

    <pre wrap="">  

20836 tasks finished with an exit code 0.

I was expecting 20497 tasks broken down as follows:

                      1

                      1

                      1

                      1

                    244

                    244

                      1

                    244

                    244

                     68

                    244

                  16592

                      1

                    244

                    244

                     11

                    244

                   2684

                      1

                    244

                    244

                      1

                    244

                    244

                  20497

I do not know why there were 339 more tasks than we were expecting.

A close look at the summary graph

(<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg</a>), we see that after the large number of failed tasks, the queue length (blue line) quickly went to 0, and then stayed there as Swift was trickling in only about 100 tasks at a time.

  For the rest of the experiment, only about 100 tasks at a time were

ever running.  This is not the first time we have seen this, and it

seems that is only showing up when there is a bad machine failing many

tasks, and essentially Swift doesn't try to resubmit them fast, and

the jobs only trickle in thereafter not keeping all the processors

busy.  

    </pre>

  </blockquote>

  <pre wrap=""><!---->

That's the job throttle set to 10000, multiplied by a score of 0.01

(after all those failures).

  </pre>

</blockquote>

OK, so should we set the job throttle higher, ideally to make sure that

even in the worst case (such as the one we found), it still sends

enough jobs to keep the processors busy?  In our case, we should have

set it to 25000 to get about 250 concurrent jobs.<br>

<blockquote cite="mid:1186938048.24879.8.camel@blabla.mcs.anl.gov"

 type="cite">

  <pre wrap="">

  </pre>

  <blockquote type="cite">

    <pre wrap="">When we had runs with no bad nodes and no large number of failures,

this did not happen, and Swift essentially submitted all independent

tasks to Falkon.  I know there is a heuristic within Karajan that is

probably affecting the submit rate of tasks after the large number of

failures happened, but I think it needs to be tuned to recover from

large number of failures so in time, it again attempts to send more.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

It does. Unfortunately jobs keep failing. </pre>

</blockquote>

I don't think that is the case... they failed in a few bunches over a

relatively small amount of time...  <br>

<blockquote cite="mid:1186938048.24879.8.camel@blabla.mcs.anl.gov"

 type="cite">

  <pre wrap="">Set the aforementioned

throttle higher until a better algorithm is stuck in the scheduler. That

or stop sending jobs to a machine that keeps failing them.

  </pre>

</blockquote>

This is not hard to do in Falkon, look at the exit codes of the

application and do some housekeeping around that, but its not all that

clear that this kind of logic should be in Falkon.  I am not sure how

easy its going to be to discern between machine failures and other

errors.  I believe the reaction within Falkon should be different

between a machine failure and other errors, so its important to discern

between these.<br>

<br>

If Falkon is to take some action when a certain machine keeps failing

jobs, what does everyone recommend?<br>

<br>

Should it blacklist the machine to never send jobs again to it, should

it just suspend the machine job dispatch for some time, should it

actually retry failed jobs on other nodes, etc...<br>

<blockquote cite="mid:1186938048.24879.8.camel@blabla.mcs.anl.gov"

 type="cite">

  <pre wrap="">

  </pre>

  <blockquote type="cite">

    <pre wrap=""> A good analogy is TCP, think of its window size increasing larger and

larger, but then a large number of packets get lost, and TCP collapses

its window size, but then never recovering from this and remaining

with a small window size for the rest of the connection, regardless of

the fact that it could again increase the window size until the next

round of lost packets...  

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Your analogy is incorrect. In this case the score is kept low because

jobs keep on failing, even after the throttling kicks in.

  </pre>

</blockquote>

I would argue against your theory.... the last job (#12794) failed at

3954 seconds into the experiment, yet the last job (#31917) was ended

at 30600 seconds.  There were no failed jobs in the last 26K+ seconds

with 19K+ jobs.  Now my question is again, why would the score not

improve at all over this large period of time and jobs, as the

throtling seems to be relatively constant throughout the experiment

(after the failed jobs).<br>

<br>

Ioan<br>

<blockquote cite="mid:1186938048.24879.8.camel@blabla.mcs.anl.gov"

 type="cite">

  <pre wrap="">

Mihael

  </pre>

  <blockquote type="cite">

    <pre wrap="">I believe the normal behavior should allow Swift to recover and again

submit many tasks to Falkon.  If this heuristic cannot be easily

tweaked or made to recover from the "window collapse", could we

disable it when we are running on Falkon at a single site?

BTW, here were the graphs from a previous run when only the last few

jobs didn't finish due to a bug in the application code.  

<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/</a>

In this run, notice that there were no bad nodes that caused many

tasks to fail, and Swift submitted many tasks to Falkon, and managed

to keep all processors busy!  

I think we can call the 244-mol MolDyn run a success, both the current

run and the previous run from 7-16-07 that almost finished!

We need to figure out how to control the job throttling better, and

perhaps on how to automatically detect this plaguing problem with

"Stale NFS handle", and possibly contain the damage to significantly

fewer task failures.  I also think that increasing the # of retries

from Swift's end should be considered when running over Falkon.

Notice that a single worker can fail as many as 1000 tasks per minute,

which are many tasks given that when the NFS stale handle shows up,

its around for tens of seconds to minutes at a time.  

BTW, the run we just made consummed about 1556.9 CPU hours (937.7 used

and 619.2 wasted) in 8.5 hours.  In contrast, the run we made on

7-16-07 which almost finished, but behaved much better since there

were no node failures, consumed about 866.4 CPU hours (866.3 used and

0.1 wasted) in 4.18 hours.  

When Nika comes back from vacation, we can try the real application,

which should consume some 16K CPU hours (service units)!   She also

has her own temporary allocation at ANL/UC now, so we can use that!

Ioan

Ioan Raicu wrote: 

    </pre>

    <blockquote type="cite">

      <pre wrap="">I think  the workflow finally completed successfully, but there are

still some oddities in the way the logs look (especially job

throttling, a few hundred more jobs than I was expecting, etc).  At

least, we have all the output we needed for every molecule!

I'll write up a summary of what happened, and draw up some nice

graphs, and send it out later today.

Ioan

iraicu@viper:/home/nefedova/alamines> ls fe_* | wc

    488     488    6832

      </pre>

    </blockquote>

  </blockquote>

  <pre wrap=""><!---->

  </pre>

</blockquote>

</body>

</html>