<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
<br>
Mihael Hategan wrote:
<blockquote cite="mid:1186978892.21992.12.camel@blabla.mcs.anl.gov"
type="cite">
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">Your analogy is incorrect. In this case the score is kept low because
jobs keep on failing, even after the throttling kicks in.
</pre>
</blockquote>
<pre wrap="">I would argue against your theory.... the last job (#12794) failed at
3954 seconds into the experiment, yet the last job (#31917) was ended
at 30600 seconds.
</pre>
</blockquote>
<pre wrap=""><!---->
Strange. I've seen jobs failing all throughout the run. Did something
make falkon stop sending jobs to that broken node? </pre>
</blockquote>
At time xxx, the bad node was deregistered for failing to answer
notifications. The graph below shows just the jobs for the bad node.
So for the first hour or so of the experiment, there were 4 jobs that
were successful (these are the faint black lines below that are
horizontal, showing their long execution time), and the rest all failed
(denoted by the small dots... showing their short execution time).
Then the node de-registered, and did not come back for the rest of the
experiment.<br>
<br>
<br>
<img src="cid:part1.00030207.00090507@cs.uchicago.edu" alt=""><br>
<br>
<blockquote cite="mid:1186978892.21992.12.camel@blabla.mcs.anl.gov"
type="cite">
<pre wrap="">Statistically every
job would have a 1/total_workers chance of going the the one place where
it shouldn't. Higher if some "good" workers are busy doing actual stuff.
</pre>
<blockquote type="cite">
<pre wrap=""> There were no failed jobs in the last 26K+ seconds with 19K+ jobs.
Now my question is again, why would the score not improve at all
</pre>
</blockquote>
<pre wrap=""><!---->
Quantify "at all". </pre>
</blockquote>
Look at
<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg</a>!<br>
Do you see the # of active (green) workers as a relatively flat line at
around 100 (and this is with the wait queue length being 0, so Swift
was simply not sending enough work to keep Falkon's 200+ workers
busy)? If the score would have improved, then I would have expected an
upward trend on the number of active workers!<br>
<blockquote cite="mid:1186978892.21992.12.camel@blabla.mcs.anl.gov"
type="cite">
<pre wrap="">Point is it may take quite a few jobs to make up for
~10000 failed ones. The ratio is 1/5 (i.e. it takes 5 successful jobs to
make up for a failed one). Should the probability of jobs failing be
less than 1/5 in a certain time window, the score should increase.
</pre>
</blockquote>
So you are saying that 19K+ successful jobs was not enough to
counteract the 10K+ failed jobs from the early part of the experiment?
Can this ratio (1:5) be changed? From this experiment, it would seem
that the euristic is a slow learner... maybe you ahve ideas on how to
make it more quick to adapt to changes?<br>
<blockquote cite="mid:1186978892.21992.12.camel@blabla.mcs.anl.gov"
type="cite">
<pre wrap="">
In the context in which jobs are sent to non-busy workers, the system
would tend to produce lots of failed jobs if it takes little time
(compared to the normal run-time of a job) for a bad worker to fail a
job. This *IS* why the swift scheduler throttles in the beginning: to
avoid sending a large number of jobs to a resource that is broken.
</pre>
</blockquote>
But not the whole resource is broken... that is the whole point here...
anyways, I think this is a valid case that we need to discuss how to
handle, to make the entire Swift+Falkon more robust!<br>
<br>
BTW, here is another experiment with MolDyn that shows the throttling
and this heuristic behaving as I would expected!<br>
<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg</a><br>
<br>
Notice the queue lenth (blue line) at around 11K seconds dropped
sharply, but then grew back up. That sudden drop was many jobs failing
fast on a bad node, and the sudden growth back up was Swift
re-submitting almost the same # of jobs that failed back to Falkon.
The same thing happened again at around 16K seconds. Now my question
is, why did it work so nicely in this experiment, and not in our
latest? Could it be that there were many succesful jobs done (10K+)
at the time of the first failure? And the failure was short enough
that it only produced maybe 1K failed jobs? If this is the reason,
then one way to make the playing field more even, to handle both cases
is to use a sliding window when training the heuristic, instead of the
entire history. You can then adjust the window size to make the
heuristic more responsive or more consistent!<br>
<br>
We should certainly talk around this issue what needs to be done, and
who will do it!<br>
<br>
Ioan<br>
<blockquote cite="mid:1186978892.21992.12.camel@blabla.mcs.anl.gov"
type="cite">
<pre wrap="">
Mihael
</pre>
<blockquote type="cite">
<pre wrap=""> over this large period of time and jobs, as the throtling seems to be
relatively constant throughout the experiment (after the failed jobs).
Ioan
</pre>
<blockquote type="cite">
<pre wrap="">Mihael
</pre>
<blockquote type="cite">
<pre wrap="">I believe the normal behavior should allow Swift to recover and again
submit many tasks to Falkon. If this heuristic cannot be easily
tweaked or made to recover from the "window collapse", could we
disable it when we are running on Falkon at a single site?
BTW, here were the graphs from a previous run when only the last few
jobs didn't finish due to a bug in the application code.
<a class="moz-txt-link-freetext" href="http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/">http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/</a>
In this run, notice that there were no bad nodes that caused many
tasks to fail, and Swift submitted many tasks to Falkon, and managed
to keep all processors busy!
I think we can call the 244-mol MolDyn run a success, both the current
run and the previous run from 7-16-07 that almost finished!
We need to figure out how to control the job throttling better, and
perhaps on how to automatically detect this plaguing problem with
"Stale NFS handle", and possibly contain the damage to significantly
fewer task failures. I also think that increasing the # of retries
from Swift's end should be considered when running over Falkon.
Notice that a single worker can fail as many as 1000 tasks per minute,
which are many tasks given that when the NFS stale handle shows up,
its around for tens of seconds to minutes at a time.
BTW, the run we just made consummed about 1556.9 CPU hours (937.7 used
and 619.2 wasted) in 8.5 hours. In contrast, the run we made on
7-16-07 which almost finished, but behaved much better since there
were no node failures, consumed about 866.4 CPU hours (866.3 used and
0.1 wasted) in 4.18 hours.
When Nika comes back from vacation, we can try the real application,
which should consume some 16K CPU hours (service units)! She also
has her own temporary allocation at ANL/UC now, so we can use that!
Ioan
Ioan Raicu wrote:
</pre>
<blockquote type="cite">
<pre wrap="">I think the workflow finally completed successfully, but there are
still some oddities in the way the logs look (especially job
throttling, a few hundred more jobs than I was expecting, etc). At
least, we have all the output we needed for every molecule!
I'll write up a summary of what happened, and draw up some nice
graphs, and send it out later today.
Ioan
iraicu@viper:/home/nefedova/alamines> ls fe_* | wc
488 488 6832
</pre>
</blockquote>
</blockquote>
<pre wrap="">
</pre>
</blockquote>
</blockquote>
<pre wrap=""><!---->
</pre>
</blockquote>
</body>
</html>