[Swift-devel] excessive rate throttling for apparently temporally-restricted failures

Mihael Hategan hategan at mcs.anl.gov
Sun Oct 28 22:00:22 CDT 2007


On Sun, 2007-10-28 at 20:05 -0500, Ioan Raicu wrote:
> This might be so, but when a user comes across behavior that is
> significantly sub-optimal (such as sending few jobs that don't utilize
> all the nodes at a site), they will want knobs to manually tune things
> to be closer to optimal (in their opinion).

This is a hindsight kind of "sub-optimal".

>   That said, you are probably right that the default setting should be
> completely automated, but there should be knobs that can be turned on,
> off, up, down, etc... to allow the user to avoid the bad behavior.

Yes, although it may at times do more harm than good, if done without a
reasonable understanding of the issue. The more complex the problem, the
more likely the users will fill in inappropriate values. The problem
right now is that the current algorithms will not reasonably make users
happy if they insist on optimality. We should try to change that.

>  For example, this means allowing the user to turn off site scoring.  

Hmm. I think this went in the wrong direction. I assumed you knew that
these things can be turned off, since we discussed this before and can
be seen in swift.properties. So I thought this was meant as an attempt
to try to understand the larger issue (as much on your side as on mine).

In particular:
throttle.score.job.factor=off

Mihael

> 
> This is not the first time we are having this discussion, and I only
> brought up these points again since Ben started up the discussion.  I
> think we all have our opinions, and in the end, I am not the one who
> will be implementing these knobs, so feel free to do what you think is
> best!
> 
> Ioan
> 
> Mihael Hategan wrote: 
> > On Sun, 2007-10-28 at 15:05 -0500, Ioan Raicu wrote:
> >   
> > > But my argument was, and still is, if there is only one site to submit
> > > to, changing situations are almost irrelevant,
> > >     
> > 
> > Missed that. It is not irrelevant. The speed/capacity of a service is
> > determined by: the jobs you submit, the jobs others submit, the specific
> > type of hardware, and the load on the service node (and other things
> > like network latency). The jobs other submit and the load on the service
> > node vary with time. The bad thing about them is that it's hard to
> > predict how they affect things.
> > 
> > Furthermore, user specified rates suffer fundamentally from the problem
> > of the user having to understand how the whole thing works and picking
> > good values. What I've observed is that this doesn't work very well.
> > 
> >   
> > > as there are no options anyhow.  Give me one example, where you have
> > > only 1 site, set X and Y properly, yet you need site scores as an
> > > additional throttling mechanism!
> > > 
> > > Mihael Hategan wrote: 
> > >     
> > > > On Sun, 2007-10-28 at 11:23 -0500, Ioan Raicu wrote:
> > > >   
> > > >       
> > > > > I mentioned 2 throttling mechanisms, one is to have X outstanding jobs
> > > > > at any given time (limits jobs in the queue), and Y jobs/sec
> > > > > submit rate (limits the rate of submission).  I believe both of these
> > > > > throttling mechanisms could exist without computing site scores,
> > > > > assuming the user knows what to set X and Y to.
> > > > >     
> > > > >         
> > > > They do exist, but they don't deal with asymmetries between sites. Nor
> > > > do they deal with changing situations.
> > > > 
> > > >   
> > > >       
> > > > > Ioan
> > > > > 
> > > > > Mihael Hategan wrote: 
> > > > >     
> > > > >         
> > > > > > On Sun, 2007-10-28 at 10:25 -0500, Ioan Raicu wrote:
> > > > > >   
> > > > > >       
> > > > > >           
> > > > > > > Assuming you have a single site to submit to, then I don't see why you
> > > > > > > don't want to disable the site scoring altogether?
> > > > > > >     
> > > > > > >         
> > > > > > >             
> > > > > > Because having too many jobs on that one site may still cause problems.
> > > > > > 
> > > > > > That said, the algorithm currently there needs some work.
> > > > > > 
> > > > > >   
> > > > > >       
> > > > > >           
> > > > > > > Of course you still want throttling, but that is more on the level
> > > > > > > of X outstanding jobs at any given time (and possibly Y jobs/sec
> > > > > > > submit rate), so you don't overrun the LRM, but you would not want to
> > > > > > > lower X to some low value just because some jobs are failing.  Again,
> > > > > > > once you go to multi-site runs, you need the site scoring to decide
> > > > > > > among the different sites, but with a single site, I see no drawbacks
> > > > > > > to disabling the site scoring mechanism.  
> > > > > > > 
> > > > > > > Ioan
> > > > > > > 
> > > > > > > Ben Clifford wrote: 
> > > > > > >     
> > > > > > >         
> > > > > > >             
> > > > > > > > On Sun, 28 Oct 2007, Ioan Raicu wrote:
> > > > > > > > 
> > > > > > > >   
> > > > > > > >       
> > > > > > > >           
> > > > > > > >               
> > > > > > > > > they were due to the stale NFS handle error.  I think Mihael outlined in an
> > > > > > > > > email a while back how to disable the task submission throttling due to a bad
> > > > > > > > > score, assuming that you have a single site to submit to anyways. 
> > > > > > > > >     
> > > > > > > > >         
> > > > > > > > >             
> > > > > > > > >                 
> > > > > > > > I know how to disable it. I don't particularly want it running rate free.
> > > > > > > > 
> > > > > > > > Whats happening here is that the feedback loop feeding back too much / too 
> > > > > > > > fast for the situation I experience.
> > > > > > > > 
> > > > > > > > There's plenty of fun to be had experimenting there; and I suspect there 
> > > > > > > > will be no One True Rate Controller.
> > > > > > > > 
> > > > > > > >   
> > > > > > > >       
> > > > > > > >           
> > > > > > > >               
> > > > > > > -- 
> > > > > > > ============================================
> > > > > > > Ioan Raicu
> > > > > > > Ph.D. Student
> > > > > > > ============================================
> > > > > > > Distributed Systems Laboratory
> > > > > > > Computer Science Department
> > > > > > > University of Chicago
> > > > > > > 1100 E. 58th Street, Ryerson Hall
> > > > > > > Chicago, IL 60637
> > > > > > > ============================================
> > > > > > > Email: iraicu at cs.uchicago.edu
> > > > > > > Web:   http://www.cs.uchicago.edu/~iraicu
> > > > > > >        http://dsl.cs.uchicago.edu/
> > > > > > > ============================================
> > > > > > > ============================================
> > > > > > > _______________________________________________
> > > > > > > Swift-devel mailing list
> > > > > > > Swift-devel at ci.uchicago.edu
> > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > > >     
> > > > > > >         
> > > > > > >             
> > > > > -- 
> > > > > ============================================
> > > > > Ioan Raicu
> > > > > Ph.D. Student
> > > > > ============================================
> > > > > Distributed Systems Laboratory
> > > > > Computer Science Department
> > > > > University of Chicago
> > > > > 1100 E. 58th Street, Ryerson Hall
> > > > > Chicago, IL 60637
> > > > > ============================================
> > > > > Email: iraicu at cs.uchicago.edu
> > > > > Web:   http://www.cs.uchicago.edu/~iraicu
> > > > >        http://dsl.cs.uchicago.edu/
> > > > > ============================================
> > > > > ============================================
> > > > >     
> > > > >         
> > > > 
> > > >       
> > > -- 
> > > ============================================
> > > Ioan Raicu
> > > Ph.D. Student
> > > ============================================
> > > Distributed Systems Laboratory
> > > Computer Science Department
> > > University of Chicago
> > > 1100 E. 58th Street, Ryerson Hall
> > > Chicago, IL 60637
> > > ============================================
> > > Email: iraicu at cs.uchicago.edu
> > > Web:   http://www.cs.uchicago.edu/~iraicu
> > >        http://dsl.cs.uchicago.edu/
> > > ============================================
> > > ============================================
> > >     
> > 
> > 
> >   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================




More information about the Swift-devel mailing list