[Swift-devel] Re: 244 MolDyn run was successful!
Ioan Raicu
iraicu at cs.uchicago.edu
Mon Aug 13 15:17:06 CDT 2007
Mihael Hategan wrote:
>>> Your analogy is incorrect. In this case the score is kept low because
>>> jobs keep on failing, even after the throttling kicks in.
>>>
>>>
>> I would argue against your theory.... the last job (#12794) failed at
>> 3954 seconds into the experiment, yet the last job (#31917) was ended
>> at 30600 seconds.
>>
>
> Strange. I've seen jobs failing all throughout the run. Did something
> make falkon stop sending jobs to that broken node?
At time xxx, the bad node was deregistered for failing to answer
notifications. The graph below shows just the jobs for the bad node.
So for the first hour or so of the experiment, there were 4 jobs that
were successful (these are the faint black lines below that are
horizontal, showing their long execution time), and the rest all failed
(denoted by the small dots... showing their short execution time). Then
the node de-registered, and did not come back for the rest of the
experiment.
> Statistically every
> job would have a 1/total_workers chance of going the the one place where
> it shouldn't. Higher if some "good" workers are busy doing actual stuff.
>
>
>> There were no failed jobs in the last 26K+ seconds with 19K+ jobs.
>> Now my question is again, why would the score not improve at all
>>
>
> Quantify "at all".
Look at
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg!
Do you see the # of active (green) workers as a relatively flat line at
around 100 (and this is with the wait queue length being 0, so Swift was
simply not sending enough work to keep Falkon's 200+ workers busy)? If
the score would have improved, then I would have expected an upward
trend on the number of active workers!
> Point is it may take quite a few jobs to make up for
> ~10000 failed ones. The ratio is 1/5 (i.e. it takes 5 successful jobs to
> make up for a failed one). Should the probability of jobs failing be
> less than 1/5 in a certain time window, the score should increase.
>
So you are saying that 19K+ successful jobs was not enough to counteract
the 10K+ failed jobs from the early part of the experiment? Can this
ratio (1:5) be changed? From this experiment, it would seem that the
euristic is a slow learner... maybe you ahve ideas on how to make it
more quick to adapt to changes?
> In the context in which jobs are sent to non-busy workers, the system
> would tend to produce lots of failed jobs if it takes little time
> (compared to the normal run-time of a job) for a bad worker to fail a
> job. This *IS* why the swift scheduler throttles in the beginning: to
> avoid sending a large number of jobs to a resource that is broken.
>
But not the whole resource is broken... that is the whole point here...
anyways, I think this is a valid case that we need to discuss how to
handle, to make the entire Swift+Falkon more robust!
BTW, here is another experiment with MolDyn that shows the throttling
and this heuristic behaving as I would expected!
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg
Notice the queue lenth (blue line) at around 11K seconds dropped
sharply, but then grew back up. That sudden drop was many jobs failing
fast on a bad node, and the sudden growth back up was Swift
re-submitting almost the same # of jobs that failed back to Falkon. The
same thing happened again at around 16K seconds. Now my question is,
why did it work so nicely in this experiment, and not in our latest?
Could it be that there were many succesful jobs done (10K+) at the time
of the first failure? And the failure was short enough that it only
produced maybe 1K failed jobs? If this is the reason, then one way to
make the playing field more even, to handle both cases is to use a
sliding window when training the heuristic, instead of the entire
history. You can then adjust the window size to make the heuristic more
responsive or more consistent!
We should certainly talk around this issue what needs to be done, and
who will do it!
Ioan
> Mihael
>
>
>> over this large period of time and jobs, as the throtling seems to be
>> relatively constant throughout the experiment (after the failed jobs).
>>
>> Ioan
>>
>>> Mihael
>>>
>>>
>>>
>>>> I believe the normal behavior should allow Swift to recover and again
>>>> submit many tasks to Falkon. If this heuristic cannot be easily
>>>> tweaked or made to recover from the "window collapse", could we
>>>> disable it when we are running on Falkon at a single site?
>>>>
>>>> BTW, here were the graphs from a previous run when only the last few
>>>> jobs didn't finish due to a bug in the application code.
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/
>>>> In this run, notice that there were no bad nodes that caused many
>>>> tasks to fail, and Swift submitted many tasks to Falkon, and managed
>>>> to keep all processors busy!
>>>>
>>>> I think we can call the 244-mol MolDyn run a success, both the current
>>>> run and the previous run from 7-16-07 that almost finished!
>>>>
>>>> We need to figure out how to control the job throttling better, and
>>>> perhaps on how to automatically detect this plaguing problem with
>>>> "Stale NFS handle", and possibly contain the damage to significantly
>>>> fewer task failures. I also think that increasing the # of retries
>>>> from Swift's end should be considered when running over Falkon.
>>>> Notice that a single worker can fail as many as 1000 tasks per minute,
>>>> which are many tasks given that when the NFS stale handle shows up,
>>>> its around for tens of seconds to minutes at a time.
>>>>
>>>> BTW, the run we just made consummed about 1556.9 CPU hours (937.7 used
>>>> and 619.2 wasted) in 8.5 hours. In contrast, the run we made on
>>>> 7-16-07 which almost finished, but behaved much better since there
>>>> were no node failures, consumed about 866.4 CPU hours (866.3 used and
>>>> 0.1 wasted) in 4.18 hours.
>>>>
>>>> When Nika comes back from vacation, we can try the real application,
>>>> which should consume some 16K CPU hours (service units)! She also
>>>> has her own temporary allocation at ANL/UC now, so we can use that!
>>>>
>>>> Ioan
>>>>
>>>> Ioan Raicu wrote:
>>>>
>>>>
>>>>> I think the workflow finally completed successfully, but there are
>>>>> still some oddities in the way the logs look (especially job
>>>>> throttling, a few hundred more jobs than I was expecting, etc). At
>>>>> least, we have all the output we needed for every molecule!
>>>>>
>>>>> I'll write up a summary of what happened, and draw up some nice
>>>>> graphs, and send it out later today.
>>>>>
>>>>> Ioan
>>>>>
>>>>> iraicu at viper:/home/nefedova/alamines> ls fe_* | wc
>>>>> 488 488 6832
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070813/18953a9a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: moz-screenshot-1.jpg
Type: image/jpeg
Size: 24336 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070813/18953a9a/attachment.jpg>
More information about the Swift-devel
mailing list