[Swift-devel] Re: 244 MolDyn run was successful!

Mon Aug 13 15:17:06 CDT 2007

Mihael Hategan wrote:
>>> Your analogy is incorrect. In this case the score is kept low because
>>> jobs keep on failing, even after the throttling kicks in.
>>>   
>>>       
>> I would argue against your theory.... the last job (#12794) failed at
>> 3954 seconds into the experiment, yet the last job (#31917) was ended
>> at 30600 seconds.
>>     
>
> Strange. I've seen jobs failing all throughout the run. Did something
> make falkon stop sending jobs to that broken node? 
At time xxx, the bad node was deregistered for failing to answer 
notifications.  The graph below shows just the jobs for the bad node.  
So for the first hour or so of the experiment, there were 4 jobs that 
were successful (these are the faint black lines below that are 
horizontal, showing their long execution time), and the rest all failed 
(denoted by the small dots... showing their short execution time).  Then 
the node de-registered, and did not come back for the rest of the 
experiment.

> Statistically every
> job would have a 1/total_workers chance of going the the one place where
> it shouldn't. Higher if some "good" workers are busy doing actual stuff.
>
>   
>>   There were no failed jobs in the last 26K+ seconds with 19K+ jobs.
>> Now my question is again, why would the score not improve at all
>>     
>
> Quantify "at all". 
Look at 
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg!
Do you see the # of active (green) workers as a relatively flat line at 
around 100 (and this is with the wait queue length being 0, so Swift was 
simply not sending enough work to keep Falkon's 200+ workers busy)?  If 
the score would have improved, then I would have expected an upward 
trend on the number of active workers!
> Point is it may take quite a few jobs to make up for
> ~10000 failed ones. The ratio is 1/5 (i.e. it takes 5 successful jobs to
> make up for a failed one). Should the probability of jobs failing be
> less than 1/5 in a certain time window, the score should increase.
>   
So you are saying that 19K+ successful jobs was not enough to counteract 
the 10K+ failed jobs from the early part of the experiment?  Can this 
ratio (1:5) be changed?  From this experiment, it would seem that the 
euristic is a slow learner... maybe you ahve ideas on how to make it 
more quick to adapt to changes?
> In the context in which jobs are sent to non-busy workers, the system
> would tend to produce lots of failed jobs if it takes little time
> (compared to the normal run-time of a job) for a bad worker to fail a
> job. This *IS* why the swift scheduler throttles in the beginning: to
> avoid sending a large number of jobs to a resource that is broken.
>   
But not the whole resource is broken... that is the whole point here... 
anyways, I think this is a valid case that we need to discuss how to 
handle, to make the entire Swift+Falkon more robust!

BTW, here is another experiment with MolDyn that shows the throttling 
and this heuristic behaving as I would expected!
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg

Notice the queue lenth (blue line) at around 11K seconds dropped 
sharply, but then grew back up.  That sudden drop was many jobs failing 
fast on a bad node, and the sudden growth back up was Swift 
re-submitting almost the same # of jobs that failed back to Falkon.  The 
same thing happened again at around 16K seconds.  Now my question is, 
why did it work so nicely in this experiment, and not in our latest?  
Could it be that there were many succesful  jobs done (10K+) at the time 
of the first failure?  And the failure was short enough that it only 
produced maybe 1K failed jobs?  If this is the reason, then one way to 
make the playing field more even, to handle both cases is to use a 
sliding window when training the heuristic, instead of the entire 
history.  You can then adjust the window size to make the heuristic more 
responsive or more consistent!

We should certainly talk around this issue what needs to be done, and 
who will do it!

Ioan
> Mihael
>
>   
>>  over this large period of time and jobs, as the throtling seems to be
>> relatively constant throughout the experiment (after the failed jobs).
>>
>> Ioan
>>     
>>> Mihael
>>>
>>>   
>>>       
>>>> I believe the normal behavior should allow Swift to recover and again
>>>> submit many tasks to Falkon.  If this heuristic cannot be easily
>>>> tweaked or made to recover from the "window collapse", could we
>>>> disable it when we are running on Falkon at a single site?
>>>>
>>>> BTW, here were the graphs from a previous run when only the last few
>>>> jobs didn't finish due to a bug in the application code.  
>>>> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/
>>>> In this run, notice that there were no bad nodes that caused many
>>>> tasks to fail, and Swift submitted many tasks to Falkon, and managed
>>>> to keep all processors busy!  
>>>>
>>>> I think we can call the 244-mol MolDyn run a success, both the current
>>>> run and the previous run from 7-16-07 that almost finished!
>>>>
>>>> We need to figure out how to control the job throttling better, and
>>>> perhaps on how to automatically detect this plaguing problem with
>>>> "Stale NFS handle", and possibly contain the damage to significantly
>>>> fewer task failures.  I also think that increasing the # of retries
>>>> from Swift's end should be considered when running over Falkon.
>>>> Notice that a single worker can fail as many as 1000 tasks per minute,
>>>> which are many tasks given that when the NFS stale handle shows up,
>>>> its around for tens of seconds to minutes at a time.  
>>>>
>>>> BTW, the run we just made consummed about 1556.9 CPU hours (937.7 used
>>>> and 619.2 wasted) in 8.5 hours.  In contrast, the run we made on
>>>> 7-16-07 which almost finished, but behaved much better since there
>>>> were no node failures, consumed about 866.4 CPU hours (866.3 used and
>>>> 0.1 wasted) in 4.18 hours.  
>>>>
>>>> When Nika comes back from vacation, we can try the real application,
>>>> which should consume some 16K CPU hours (service units)!   She also
>>>> has her own temporary allocation at ANL/UC now, so we can use that!
>>>>
>>>> Ioan
>>>>
>>>> Ioan Raicu wrote: 
>>>>     
>>>>         
>>>>> I think  the workflow finally completed successfully, but there are
>>>>> still some oddities in the way the logs look (especially job
>>>>> throttling, a few hundred more jobs than I was expecting, etc).  At
>>>>> least, we have all the output we needed for every molecule!
>>>>>
>>>>> I'll write up a summary of what happened, and draw up some nice
>>>>> graphs, and send it out later today.
>>>>>
>>>>> Ioan
>>>>>
>>>>> iraicu at viper:/home/nefedova/alamines> ls fe_* | wc
>>>>>     488     488    6832
>>>>>
>>>>>       
>>>>>           
>>>   
>>>       
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070813/18953a9a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: moz-screenshot-1.jpg
Type: image/jpeg
Size: 24336 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070813/18953a9a/attachment.jpg>