[Swift-devel] excessive rate throttling for apparently temporally-restricted failures

Mihael Hategan hategan at mcs.anl.gov
Mon Oct 29 00:22:39 CDT 2007


On Mon, 2007-10-29 at 00:02 -0500, Ioan Raicu wrote:
> One thing that I noticed is that the site score is quick to react to
> failed jobs, but slow to react to successful jobs, to the point that
> things take a long time to recover once some rough waters were
> encountered. Using the car analogy from below, it would be like coming
> to a downward slope where the cruise control adjusts the throttle
> position from say 25% to 5% while on the downward slope, but then when
> it gets back on flat ground, not going back to 25% for a long time due
> to the previous downward slope. Basically, I think the algorithm
> memory might need some tunning, maybe using a window based memory (as
> opposed to the entire history of memory), or perhaps give higher
> weight to more recent events, weight events according to their
> execution times, reward more successive good jobs to allow the system
> to get back a high score faster if jobs keep completing successfully,
> etc...  certainly lots of things to try out!

Yep. It is a very interesting thing.

It is intentional that bad jobs affect score different from good jobs.
Basically I don't want a site with 50% reliability to keep a constant
score. That screws the retries. That ratio basically defines the
reliability goal. 1/4 yields an 80% target reliability. With 4 restarts
that's about 99.9% reliability, while the 50% case only gives 93% after
4 restarts. But this assumes the number of concurrent jobs on a site
determines the reliability, which I think is only a rough approximation.

But yes, I think it should somehow integrate time dependence better. In
this particular case it should actually account for the fact that a
cluster failing should only register as a single job failing for scoring
purposes.

Anyway. Lots of refinements can be done here. Know any PhD student
interested?

> 
> Mihael Hategan wrote: 
> > For some reason this quote from that article seems particularily
> > relevant (in that it shows how similar the problem is):
> > 
> > A simple way to implement cruise control is to lock the throttle
> > position when the driver engages cruise control. However, on hilly
> > terrain, the vehicle will slow down going uphill and accelerate going
> > downhill. This type of controller is called an open-loop controller
> > because there is no direct connection between the output of the system
> > (the engine torque) and its input (the throttle position).
> > 
> > In a closed-loop control system, a feedback controller monitors the
> > output (the vehicle's speed) and adjusts the control input (the
> > throttle) as necessary to keep the control error to a minimum (to
> > maintain the desired speed). This feedback dynamically compensates for
> > disturbances to the system, such as changes in slope of the ground or
> > wind speed.
> > 
> > 
> > On Sun, 2007-10-28 at 22:48 -0500, Ioan Raicu wrote:
> >   
> > > Right, I now remember reading that... too many emails, and our
> > > discussion got side-tracked :)
> > > Thanks for the control theory link, it looks like a good read!
> > > 
> > > Ioan
> > > 
> > > Mihael Hategan wrote: 
> > >     
> > > > On Sun, 2007-10-28 at 22:41 -0500, Ioan Raicu wrote:
> > > >   
> > > >       
> > > > > If the knobs are all there, then I don't think there is an issue at
> > > > > the moment.  I think this all started by Ben saying that there was
> > > > > excessive throttling due to the site scoring.  Understanding how to
> > > > > fix the site scoring is one thing.  Being able to disable site scoring
> > > > > is another, which seems to be there already.  Ben, can you turn site
> > > > > scoring off, and see if that solves your problem for now?
> > > > >     
> > > > >         
> > > > You can re-read Ben's earlier answer to your same question. I'll post it
> > > > here:
> > > > 
> > > >   
> > > >       
> > > > > I know how to disable it. I don't particularly want it running rate free.
> > > > > 
> > > > > Whats happening here is that the feedback loop feeding back too much / too 
> > > > > fast for the situation I experience.
> > > > > 
> > > > > There's plenty of fun to be had experimenting there; and I suspect there 
> > > > > will be no One True Rate Controller.
> > > > > 
> > > > > 
> > > > >     
> > > > >         
> > > > 
> > > >   
> > > >       
> > > -- 
> > > ============================================
> > > Ioan Raicu
> > > Ph.D. Student
> > > ============================================
> > > Distributed Systems Laboratory
> > > Computer Science Department
> > > University of Chicago
> > > 1100 E. 58th Street, Ryerson Hall
> > > Chicago, IL 60637
> > > ============================================
> > > Email: iraicu at cs.uchicago.edu
> > > Web:   http://www.cs.uchicago.edu/~iraicu
> > >        http://dsl.cs.uchicago.edu/
> > > ============================================
> > > ============================================
> > >     
> > 
> > 
> >   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================




More information about the Swift-devel mailing list