<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<br>

<br>

Mihael Hategan wrote:

<blockquote cite="mid:1186978892.21992.12.camel@blabla.mcs.anl.gov"

 type="cite">

  <blockquote type="cite">

    <blockquote type="cite">

      <pre wrap="">Your analogy is incorrect. In this case the score is kept low because

jobs keep on failing, even after the throttling kicks in.

      </pre>

    </blockquote>

    <pre wrap="">I would argue against your theory.... the last job (#12794) failed at

3954 seconds into the experiment, yet the last job (#31917) was ended

at 30600 seconds.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Strange. I've seen jobs failing all throughout the run. Did something

make falkon stop sending jobs to that broken node? </pre>

</blockquote>

At time xxx, the bad node was deregistered for failing to answer

notifications.  The graph below shows just the jobs for the bad node. 

So for the first hour or so of the experiment, there were 4 jobs that

were successful (these are the faint black lines below that are

horizontal, showing their long execution time), and the rest all failed

(denoted by the small dots... showing their short execution time). 

Then the node de-registered, and did not come back for the rest of the

experiment.<br>

<br>

<br>

<img src="cid:part1.00030207.00090507@cs.uchicago.edu" alt=""><br>

<br>

<blockquote cite="mid:1186978892.21992.12.camel@blabla.mcs.anl.gov"

 type="cite">

  <pre wrap="">Statistically every

job would have a 1/total_workers chance of going the the one place where

it shouldn't. Higher if some "good" workers are busy doing actual stuff.

  </pre>

  <blockquote type="cite">

    <pre wrap="">  There were no failed jobs in the last 26K+ seconds with 19K+ jobs.

Now my question is again, why would the score not improve at all

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Quantify "at all". </pre>

</blockquote>

Look at

<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg</a>!<br>

Do you see the # of active (green) workers as a relatively flat line at

around 100 (and this is with the wait queue length being 0, so Swift

was simply not sending enough work to keep Falkon's 200+ workers

busy)?  If the score would have improved, then I would have expected an

upward trend on the number of active workers!<br>

<blockquote cite="mid:1186978892.21992.12.camel@blabla.mcs.anl.gov"

 type="cite">

  <pre wrap="">Point is it may take quite a few jobs to make up for

~10000 failed ones. The ratio is 1/5 (i.e. it takes 5 successful jobs to

make up for a failed one). Should the probability of jobs failing be

less than 1/5 in a certain time window, the score should increase.

  </pre>

</blockquote>

So you are saying that 19K+ successful jobs was not enough to

counteract the 10K+ failed jobs from the early part of the experiment? 

Can this ratio (1:5) be changed?  From this experiment, it would seem

that the euristic is a slow learner... maybe you ahve ideas on how to

make it more quick to adapt to changes?<br>

<blockquote cite="mid:1186978892.21992.12.camel@blabla.mcs.anl.gov"

 type="cite">

  <pre wrap="">

In the context in which jobs are sent to non-busy workers, the system

would tend to produce lots of failed jobs if it takes little time

(compared to the normal run-time of a job) for a bad worker to fail a

job. This *IS* why the swift scheduler throttles in the beginning: to

avoid sending a large number of jobs to a resource that is broken.

  </pre>

</blockquote>

But not the whole resource is broken... that is the whole point here...

anyways, I think this is a valid case that we need to discuss how to

handle, to make the entire Swift+Falkon more robust!<br>

<br>

BTW, here is another experiment with MolDyn that shows the throttling

and this heuristic behaving as I would expected!<br>

<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg</a><br>

<br>

Notice the queue lenth (blue line) at around 11K seconds dropped

sharply, but then grew back up.  That sudden drop was many jobs failing

fast on a bad node, and the sudden growth back up was Swift

re-submitting almost the same # of jobs that failed back to Falkon. 

The same thing happened again at around 16K seconds.  Now my question

is, why did it work so nicely in this experiment, and not in our

latest?  Could it be that there were many succesful  jobs done (10K+)

at the time of the first failure?  And the failure was short enough

that it only produced maybe 1K failed jobs?  If this is the reason,

then one way to make the playing field more even, to handle both cases

is to use a sliding window when training the heuristic, instead of the

entire history.  You can then adjust the window size to make the

heuristic more responsive or more consistent!<br>

<br>

We should certainly talk around this issue what needs to be done, and

who will do it!<br>

<br>

Ioan<br>

<blockquote cite="mid:1186978892.21992.12.camel@blabla.mcs.anl.gov"

 type="cite">

  <pre wrap="">

Mihael

  </pre>

  <blockquote type="cite">

    <pre wrap=""> over this large period of time and jobs, as the throtling seems to be

relatively constant throughout the experiment (after the failed jobs).

Ioan

    </pre>

    <blockquote type="cite">

      <pre wrap="">Mihael

      </pre>

      <blockquote type="cite">

        <pre wrap="">I believe the normal behavior should allow Swift to recover and again

submit many tasks to Falkon.  If this heuristic cannot be easily

tweaked or made to recover from the "window collapse", could we

disable it when we are running on Falkon at a single site?

BTW, here were the graphs from a previous run when only the last few

jobs didn't finish due to a bug in the application code.  

<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/</a>

In this run, notice that there were no bad nodes that caused many

tasks to fail, and Swift submitted many tasks to Falkon, and managed

to keep all processors busy!  

I think we can call the 244-mol MolDyn run a success, both the current

run and the previous run from 7-16-07 that almost finished!

We need to figure out how to control the job throttling better, and

perhaps on how to automatically detect this plaguing problem with

"Stale NFS handle", and possibly contain the damage to significantly

fewer task failures.  I also think that increasing the # of retries

from Swift's end should be considered when running over Falkon.

Notice that a single worker can fail as many as 1000 tasks per minute,

which are many tasks given that when the NFS stale handle shows up,

its around for tens of seconds to minutes at a time.  

BTW, the run we just made consummed about 1556.9 CPU hours (937.7 used

and 619.2 wasted) in 8.5 hours.  In contrast, the run we made on

7-16-07 which almost finished, but behaved much better since there

were no node failures, consumed about 866.4 CPU hours (866.3 used and

0.1 wasted) in 4.18 hours.  

When Nika comes back from vacation, we can try the real application,

which should consume some 16K CPU hours (service units)!   She also

has her own temporary allocation at ANL/UC now, so we can use that!

Ioan

Ioan Raicu wrote: 

        </pre>

        <blockquote type="cite">

          <pre wrap="">I think  the workflow finally completed successfully, but there are

still some oddities in the way the logs look (especially job

throttling, a few hundred more jobs than I was expecting, etc).  At

least, we have all the output we needed for every molecule!

I'll write up a summary of what happened, and draw up some nice

graphs, and send it out later today.

Ioan

iraicu@viper:/home/nefedova/alamines> ls fe_* | wc

    488     488    6832

          </pre>

        </blockquote>

      </blockquote>

      <pre wrap="">

      </pre>

    </blockquote>

  </blockquote>

  <pre wrap=""><!---->

  </pre>

</blockquote>

</body>

</html>