[Swift-devel] excessive rate throttling for apparently temporally-restricted failures

Mihael Hategan hategan at mcs.anl.gov
Mon Oct 29 01:07:44 CDT 2007


On Mon, 2007-10-29 at 00:22 -0500, Mihael Hategan wrote:
> On Mon, 2007-10-29 at 00:02 -0500, Ioan Raicu wrote:
> > One thing that I noticed is that the site score is quick to react to
> > failed jobs, but slow to react to successful jobs, to the point that
> > things take a long time to recover once some rough waters were
> > encountered. Using the car analogy from below, it would be like coming
> > to a downward slope where the cruise control adjusts the throttle
> > position from say 25% to 5% while on the downward slope, but then when
> > it gets back on flat ground, not going back to 25% for a long time due
> > to the previous downward slope. Basically, I think the algorithm
> > memory might need some tunning, maybe using a window based memory (as
> > opposed to the entire history of memory), or perhaps give higher
> > weight to more recent events, weight events according to their
> > execution times, reward more successive good jobs to allow the system
> > to get back a high score faster if jobs keep completing successfully,
> > etc...  certainly lots of things to try out!
> 
> Yep. It is a very interesting thing.
> 
> It is intentional that bad jobs affect score different from good jobs.
> Basically I don't want a site with 50% reliability to keep a constant
> score. That screws the retries. That ratio basically defines the
> reliability goal. 1/4 yields an 80% target reliability. With 4 restarts
> that's about 99.9% reliability, while the 50% case only gives 93% after
> 4 restarts. But this assumes the number of concurrent jobs on a site
> determines the reliability, which I think is only a rough approximation.

Also the value used for throttling isn't linear. It's something like
e^(B*arctan(C*score)), where B and C are constants empirically
determined. Try plotting it with gnuplot to see what it looks like*.
This is there to satisfy some things:
- Stability (output being bound - this function leads to stronger than
BIBO stability in process control because it's still bound for infinite
input; therefore it may be too strict and one source of problems)
- Tweakability of the first derivative around 0. This basically dictates
how the output grows around the origin (i.e. when the workflow starts,
how do we allow the throttle to grow)
- Continuity (I think this is there to make sure there are no crazy
oscillations, although it's a bit silly because this is rather a
discrete time system so it may happen anyway)
- Continuity of the first derivative (just felt elegant, so it probably
has some meaning).
- Lower bound strictly positive (in this case it's actually
1/upper_bound) - i.e. we always leave some small odds that a job will
eventually be sent to a site to allow it to increase its score.

Anyway, this is rather rudimentary, but somewhat effective. I think one
problem is that this is only indirectly part of the feedback loop (in
that it only affects the throttling, not the state/score). The real way
to do this is to specify the whole system as accurately as possible and
actually model the transfer function, but that's done with nasty
mathematics for which I didn't have the mood at the time, nor the
appropriate recollection (if I ever had that knowledge). That or
Matlab/Simulink.

(*) C = 0.2, B = 2 * ln(T)/PI, where T = 100 (the upper bound and the
inverse of the lower bound).

> 
> But yes, I think it should somehow integrate time dependence better. In
> this particular case it should actually account for the fact that a
> cluster failing should only register as a single job failing for scoring
> purposes.
> 
> Anyway. Lots of refinements can be done here. Know any PhD student
> interested?
> 
> > 
> > Mihael Hategan wrote: 
> > > For some reason this quote from that article seems particularily
> > > relevant (in that it shows how similar the problem is):
> > > 
> > > A simple way to implement cruise control is to lock the throttle
> > > position when the driver engages cruise control. However, on hilly
> > > terrain, the vehicle will slow down going uphill and accelerate going
> > > downhill. This type of controller is called an open-loop controller
> > > because there is no direct connection between the output of the system
> > > (the engine torque) and its input (the throttle position).
> > > 
> > > In a closed-loop control system, a feedback controller monitors the
> > > output (the vehicle's speed) and adjusts the control input (the
> > > throttle) as necessary to keep the control error to a minimum (to
> > > maintain the desired speed). This feedback dynamically compensates for
> > > disturbances to the system, such as changes in slope of the ground or
> > > wind speed.
> > > 
> > > 
> > > On Sun, 2007-10-28 at 22:48 -0500, Ioan Raicu wrote:
> > >   
> > > > Right, I now remember reading that... too many emails, and our
> > > > discussion got side-tracked :)
> > > > Thanks for the control theory link, it looks like a good read!
> > > > 
> > > > Ioan
> > > > 
> > > > Mihael Hategan wrote: 
> > > >     
> > > > > On Sun, 2007-10-28 at 22:41 -0500, Ioan Raicu wrote:
> > > > >   
> > > > >       
> > > > > > If the knobs are all there, then I don't think there is an issue at
> > > > > > the moment.  I think this all started by Ben saying that there was
> > > > > > excessive throttling due to the site scoring.  Understanding how to
> > > > > > fix the site scoring is one thing.  Being able to disable site scoring
> > > > > > is another, which seems to be there already.  Ben, can you turn site
> > > > > > scoring off, and see if that solves your problem for now?
> > > > > >     
> > > > > >         
> > > > > You can re-read Ben's earlier answer to your same question. I'll post it
> > > > > here:
> > > > > 
> > > > >   
> > > > >       
> > > > > > I know how to disable it. I don't particularly want it running rate free.
> > > > > > 
> > > > > > Whats happening here is that the feedback loop feeding back too much / too 
> > > > > > fast for the situation I experience.
> > > > > > 
> > > > > > There's plenty of fun to be had experimenting there; and I suspect there 
> > > > > > will be no One True Rate Controller.
> > > > > > 
> > > > > > 
> > > > > >     
> > > > > >         
> > > > > 
> > > > >   
> > > > >       
> > > > -- 
> > > > ============================================
> > > > Ioan Raicu
> > > > Ph.D. Student
> > > > ============================================
> > > > Distributed Systems Laboratory
> > > > Computer Science Department
> > > > University of Chicago
> > > > 1100 E. 58th Street, Ryerson Hall
> > > > Chicago, IL 60637
> > > > ============================================
> > > > Email: iraicu at cs.uchicago.edu
> > > > Web:   http://www.cs.uchicago.edu/~iraicu
> > > >        http://dsl.cs.uchicago.edu/
> > > > ============================================
> > > > ============================================
> > > >     
> > > 
> > > 
> > >   
> > 
> > -- 
> > ============================================
> > Ioan Raicu
> > Ph.D. Student
> > ============================================
> > Distributed Systems Laboratory
> > Computer Science Department
> > University of Chicago
> > 1100 E. 58th Street, Ryerson Hall
> > Chicago, IL 60637
> > ============================================
> > Email: iraicu at cs.uchicago.edu
> > Web:   http://www.cs.uchicago.edu/~iraicu
> >        http://dsl.cs.uchicago.edu/
> > ============================================
> > ============================================
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list