[Swift-devel] Re: Request for control over throttle algorithm

Mon Aug 27 15:34:41 CDT 2007

On Mon, 2007-08-27 at 15:07 -0500, Michael Wilde wrote:
> [changing subject line to start a new thread]
> 
> Mihael, all,
> 
> I'm observing again that Karajan job throttling algorithms need more 
> discussion, design and testing, and that in the meantime - and perhaps 
> always - we need simple ways to override the algorithms and manually 
> control the throttle.

Here's what happens:
1. somebody says "I don't like throttling because it decreases the
performance" (that's what throttles do, in order to make things not
fail)
2. we collectively conclude that we should disable throttling
3. there are options to change those in swift.properties (and one in
scheduler.xml which I will also add to swift.properties), and they are
increased to "virtually off" numbers (I need to add an explicit "off" to
make things easier)
4. the workflows still don't work very well because there are lots of
failures now, and quality drops
5. throttles are set back to reasonable values
6. maybe some things are changed (i.e. gram -> falkon), but
fundamentally the problems are the same (different scales though)
7. GOTO 1

> 
> This is true for throttling both successful and failing jobs.
> 
> Right now MolDyn progress is being impeded by a situation where a single 
> bad cluster node (with stale FS file handles) has an unduly negative 
> impact on overall workflow performance.

Yes. And this is how things work. There are problems. It's a statement
of fact.

> 
> I feel that before we discuss and work on the nuances of throttling 
> algorithms (which will take some time to perfect) we should provide a 
> simple and reliable way for the user to override the default heuristics 
> and achieve good performance in situations that are currently occurring.

Groovy. Would the above (all throttling parameters in swift.properties
and the "off" option for each) work?

> 
> How much work it would take to provide a config parameter that causes 
> failed jobs to get retried immediately with no delay or scheduling 
> penalty? I.e., let the user set the "failure penalty" ratio to reduce or 
> eliminate the penalty for failures.

I'd suggest simply not throttling on such things. 

There can also be an option for tweaking the factors, but I have at
least one small adversion towards having too many things in
swift.properties.

Mihael

> 
> Its possible that once we have this control, we'd need a few other 
> parameters to make reasonable things happen in the case of running on 
> one or more Falkon sites.
> 
> In tandem with this, Falkon will provide parameters to control what 
> happens to a node after a failure:
> - a failure analyzer will attempt to recognize node failures as opposed 
> to app failures (some of this may need to go into the Swift launcher, 
> wrapper.sh
> - on known node failures Falkon will log the failure to bring to 
> sysadmin attention, and will also leave the node held
> - In the future falcon will add new nodes to compensate for nodes that 
> it has disabled.
> 
> I'd like to ask that we focus discussion on what is needed to design and 
> implement these basic changes, and whether they would solve the current 
> problems and be useful in general.
> 
> - Mike
> 
> 
> 
> 
> 
> Mihael Hategan wrote:
> > On Mon, 2007-08-27 at 13:25 -0500, Ioan Raicu wrote:
> >> The question I am interested in, can you modify the heuristic to take
> >> into account the execution time of tasks when updating the site score?
> > 
> > I thought I mentioned I can.
> > 
> >>   I think it is important you use only the execution time (and not
> >> Falkon queue time + execution time + result delivery time); in this
> >> case, how does Falkon pass this information back to Swift?
> > 
> > I thought I mentioned why that's not a good idea. Here's a short
> > version:
> > If Falkon is slow for some reason, that needs to be taken into account.
> > Excluding it from measurements under the assumption that it will always
> > be fast is not a particularly good idea. And if it is always fast then
> > it doesn't matter much since it won't add much overhead.
> > 
> >> Ioan
> >>
> >> Mihael Hategan wrote: 
> >>> On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:
> >>>   
> >>>> On Mon, 27 Aug 2007, Ioan Raicu wrote:
> >>>>
> >>>>     
> >>>>> On a similar note, IMO, the heuristic in Karajan should be modified to take
> >>>>> into account the task execution time of the failed or successful task, and not
> >>>>> just the number of tasks.  This would ensure that Swift is not throttling task
> >>>>> submission to Falkon when there are 1000s of successful tasks that take on the
> >>>>> order of 100s of second to complete, yet there are also 1000s of failed tasks
> >>>>> that are only 10 ms long.  This is exactly the case with MolDyn, when we get a
> >>>>> bad node in a bunch of 100s of nodes, which ends up throttling the number of
> >>>>> active and running tasks to about 100, regardless of the number of processors
> >>>>> Falkon has. 
> >>>>>       
> >>>> Is that different from when submitting to PBS or GRAM where there are 
> >>>> 1000s of successful tasks taking 100s of seconds to complete but with 
> >>>> 1000s of failed tasks that are only 10ms long?
> >>>>     
> >>> In your scenario, assuming that GRAM and PBS do work (since some jobs
> >>> succeed), then you can't really submit that fast. So the same thing
> >>> would happen, but slower. Unfortunately, in the PBS case, there's not
> >>> much that can be done but to throttle until no more jobs than good nodes
> >>> are being run at one time.
> >>>
> >>> Now, there is the probing part, which makes the system start with a
> >>> lower throttle which increases until problems appear. If this is
> >>> disabled (as it was in the ModDyn run), large numbers of parallel jobs
> >>> will be submitted causing a large number of failures.
> >>>
> >>> So this whole thing is close to a linear system with negative feedback.
> >>> If the initial state is very far away from stability, there will be
> >>> large transients. You're more than welcome to study how to make it
> >>> converge faster, or how to guess the initial state better (knowing the
> >>> number of nodes a cluster has would be a step).
> >>>
> >>>   
> >>>
> >>>
> >>>   
> >> -- 
> >> ============================================
> >> Ioan Raicu
> >> Ph.D. Student
> >> ============================================
> >> Distributed Systems Laboratory
> >> Computer Science Department
> >> University of Chicago
> >> 1100 E. 58th Street, Ryerson Hall
> >> Chicago, IL 60637
> >> ============================================
> >> Email: iraicu at cs.uchicago.edu
> >> Web:   http://www.cs.uchicago.edu/~iraicu
> >>        http://dsl.cs.uchicago.edu/
> >> ============================================
> >> ============================================
> > 
> > 
>