[Swift-devel] Re: 244 MolDyn run was successful!

Mihael Hategan hategan at mcs.anl.gov
Mon Aug 13 15:47:11 CDT 2007


On Mon, 2007-08-13 at 15:17 -0500, Ioan Raicu wrote:
> 
> 
> Mihael Hategan wrote: 
> > > > Your analogy is incorrect. In this case the score is kept low because
> > > > jobs keep on failing, even after the throttling kicks in.
> > > >   
> > > >       
> > > I would argue against your theory.... the last job (#12794) failed at
> > > 3954 seconds into the experiment, yet the last job (#31917) was ended
> > > at 30600 seconds.
> > >     
> > 
> > Strange. I've seen jobs failing all throughout the run. Did something
> > make falkon stop sending jobs to that broken node? 
> At time xxx, the bad node was deregistered for failing to answer
> notifications.  The graph below shows just the jobs for the bad node.
> So for the first hour or so of the experiment, there were 4 jobs that
> were successful (these are the faint black lines below that are
> horizontal, showing their long execution time), and the rest all
> failed (denoted by the small dots... showing their short execution
> time).  Then the node de-registered, and did not come back for the
> rest of the experiment.
> 

Ok.

> 
> > Statistically every
> > job would have a 1/total_workers chance of going the the one place where
> > it shouldn't. Higher if some "good" workers are busy doing actual stuff.
> > 
> >   
> > > There were no failed jobs in the last 26K+ seconds with 19K+ jobs.
> > > Now my question is again, why would the score not improve at all
> > >     
> > 
> > Quantify "at all". 
> Look at
> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-success-8-10-07/summary_graph-med.jpg!
> Do you see the # of active (green) workers as a relatively flat line
> at around 100 (and this is with the wait queue length being 0, so
> Swift was simply not sending enough work to keep Falkon's 200+ workers
> busy)?  If the score would have improved, then I would have expected
> an upward trend on the number of active workers!

small != not at all

> > Point is it may take quite a few jobs to make up for
> > ~10000 failed ones. The ratio is 1/5 (i.e. it takes 5 successful jobs to
> > make up for a failed one). Should the probability of jobs failing be
> > less than 1/5 in a certain time window, the score should increase.
> >   
> So you are saying that 19K+ successful jobs was not enough to
> counteract the 10K+ failed jobs from the early part of the
> experiment? 

Yep. 19*1/5 = 3.8 < 10.

>  Can this ratio (1:5) be changed?

Yes. The scheduler has two relevant properties: successFactor (currently
0.1) and failureFactor (currently -0.5). The term "factor" is not used
formally, since these get added to the current score.

>   From this experiment, it would seem that the euristic is a slow
> learner... maybe you ahve ideas on how to make it more quick to adapt
> to changes?

That could perhaps be done.

> > In the context in which jobs are sent to non-busy workers, the system
> > would tend to produce lots of failed jobs if it takes little time
> > (compared to the normal run-time of a job) for a bad worker to fail a
> > job. This *IS* why the swift scheduler throttles in the beginning: to
> > avoid sending a large number of jobs to a resource that is broken.
> >   
> But not the whole resource is broken... 

No, just slightly more than 1/3 of it. At least that's how it appears
from the outside.

> that is the whole point here... 

This point comes because you KNOW how things work internally. All Swift
sees is 10K failed jobs out of 29K.

> anyways, I think this is a valid case that we need to discuss how to
> handle, to make the entire Swift+Falkon more robust!
> 
> BTW, here is another experiment with MolDyn that shows the throttling
> and this heuristic behaving as I would expected!
> http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg
> 
> Notice the queue lenth (blue line) at around 11K seconds dropped
> sharply, but then grew back up.  That sudden drop was many jobs
> failing fast on a bad node, and the sudden growth back up was Swift
> re-submitting almost the same # of jobs that failed back to Falkon.

That failing many jobs fast behavior is not right, regardless of whether
Swift can deal with it or not. Frankly I'd rather Swift not be the part
to deal with it because it has to resort to heuristics, whereas Falkon
has direct knowledge of which nodes do what.

Mihael

>   The same thing happened again at around 16K seconds.  Now my
> question is, why did it work so nicely in this experiment, and not in
> our latest?  Could it be that there were many succesful  jobs done
> (10K+) at the time of the first failure?  And the failure was short
> enough that it only produced maybe 1K failed jobs?  If this is the
> reason, then one way to make the playing field more even, to handle
> both cases is to use a sliding window when training the heuristic,
> instead of the entire history.  You can then adjust the window size to
> make the heuristic more responsive or more consistent!
> 
> We should certainly talk around this issue what needs to be done, and
> who will do it!
> 
> Ioan
> > Mihael
> > 
> >   
> > > over this large period of time and jobs, as the throtling seems to be
> > > relatively constant throughout the experiment (after the failed jobs).
> > > 
> > > Ioan
> > >     
> > > > Mihael
> > > > 
> > > >   
> > > >       
> > > > > I believe the normal behavior should allow Swift to recover and again
> > > > > submit many tasks to Falkon.  If this heuristic cannot be easily
> > > > > tweaked or made to recover from the "window collapse", could we
> > > > > disable it when we are running on Falkon at a single site?
> > > > > 
> > > > > BTW, here were the graphs from a previous run when only the last few
> > > > > jobs didn't finish due to a bug in the application code.  
> > > > > http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed-7-16-07/
> > > > > In this run, notice that there were no bad nodes that caused many
> > > > > tasks to fail, and Swift submitted many tasks to Falkon, and managed
> > > > > to keep all processors busy!  
> > > > > 
> > > > > I think we can call the 244-mol MolDyn run a success, both the current
> > > > > run and the previous run from 7-16-07 that almost finished!
> > > > > 
> > > > > We need to figure out how to control the job throttling better, and
> > > > > perhaps on how to automatically detect this plaguing problem with
> > > > > "Stale NFS handle", and possibly contain the damage to significantly
> > > > > fewer task failures.  I also think that increasing the # of retries
> > > > > from Swift's end should be considered when running over Falkon.
> > > > > Notice that a single worker can fail as many as 1000 tasks per minute,
> > > > > which are many tasks given that when the NFS stale handle shows up,
> > > > > its around for tens of seconds to minutes at a time.  
> > > > > 
> > > > > BTW, the run we just made consummed about 1556.9 CPU hours (937.7 used
> > > > > and 619.2 wasted) in 8.5 hours.  In contrast, the run we made on
> > > > > 7-16-07 which almost finished, but behaved much better since there
> > > > > were no node failures, consumed about 866.4 CPU hours (866.3 used and
> > > > > 0.1 wasted) in 4.18 hours.  
> > > > > 
> > > > > When Nika comes back from vacation, we can try the real application,
> > > > > which should consume some 16K CPU hours (service units)!   She also
> > > > > has her own temporary allocation at ANL/UC now, so we can use that!
> > > > > 
> > > > > Ioan
> > > > > 
> > > > > Ioan Raicu wrote: 
> > > > >     
> > > > >         
> > > > > > I think  the workflow finally completed successfully, but there are
> > > > > > still some oddities in the way the logs look (especially job
> > > > > > throttling, a few hundred more jobs than I was expecting, etc).  At
> > > > > > least, we have all the output we needed for every molecule!
> > > > > > 
> > > > > > I'll write up a summary of what happened, and draw up some nice
> > > > > > graphs, and send it out later today.
> > > > > > 
> > > > > > Ioan
> > > > > > 
> > > > > > iraicu at viper:/home/nefedova/alamines> ls fe_* | wc
> > > > > >     488     488    6832
> > > > > > 
> > > > > >       
> > > > > >           
> > > > 
> > > >       
> > 
> > 
> >   




More information about the Swift-devel mailing list