[Swift-devel] Re: Request for control over throttle algorithm

Mon Aug 27 16:55:33 CDT 2007

On Mon, 2007-08-27 at 16:15 -0500, Michael Wilde wrote:
> Mihael Hategan wrote:
> > On Mon, 2007-08-27 at 15:07 -0500, Michael Wilde wrote:
> >> [changing subject line to start a new thread]
> >>
> >> Mihael, all,
> >>
> >> I'm observing again that Karajan job throttling algorithms need more 
> >> discussion, design and testing, and that in the meantime - and perhaps 
> >> always - we need simple ways to override the algorithms and manually 
> >> control the throttle.
> > 
> > Here's what happens:
> > 1. somebody says "I don't like throttling because it decreases the
> > performance" (that's what throttles do, in order to make things not
> > fail)
> 
> No. What was said was: We are trying to get a workflow running for a 
> real science user - on whose success we depend on.  And in the process 
> of doing that, the current obstacle to good performance is a 
> failure-retry behavior that is not working well.

I think the problem is a misunderstanding about what "good performance"
means given the current assumptions. Sub-systems begin to break as more
and more performance is requested from them, causing chained failures.
Throttling tries to achieve that balance between what's too much and
what's too little. Yong and I played with some of the numbers. And some
of those approximate that balance. But not everybody is convinced it
seems. Which is fine. The reaction has however always been "let's
disable throttling". Which is also fine. But only once.

> 
> > 2. we collectively conclude that we should disable throttling
> 
> Several of us believe that in *this* case it will enable the workflow to 
> *finally* succeed and will also yield better performance.

I think that's wrong, and I think the long discussions on the mol-dyn
run topic explain why. In short lack of throttling will cause large
numbers of failures. But beyond that, please, I'm not trying to stop
anybody from disabling these. I've mentioned how, if there are further
questions, I'm happy to answer them.

>   Note that the 
> default settings do not even let the workflow complete successully.

That's only correlation. I highly doubt the throttles are the reason the
workflow didn't complete.

> 
> > 3. there are options to change those in swift.properties (and one in
> > scheduler.xml which I will also add to swift.properties), and they are
> > increased to "virtually off" numbers (I need to add an explicit "off" to
> > make things easier)
> 
> This is great - just what we need.  But I think Ioan cant find the prior 
> email in which you describe them, and I couldnt either.  Could you 
> re-state what to set, please?

all throttle.* properties to, say 100000021.
libexec/scheduler.xml > <property name="jobThrottle" value="100000021"/>

> 
> > 4. the workflows still don't work very well because there are lots of
> > failures now, and quality drops
> 
> That would be a different scenario.  In this case, Ioan will try to take 
> the offending node(s) out of service as seen by Falkon.

Right. Which is a particular case of throttling done because there's
better info available (i.e. set throttle to 0 on bad nodes).

> 
> > 5. throttles are set back to reasonable values
> 
> Yes, thats the goal.  I believe that automated failure handling is 
> difficult and takes a while - lots of design, measurement, test, improve 
> - before they work well.  Certainly the internet and TCP/IP teaches us 
> that.  Critical, necessary, but a long road.
> 
> > 6. maybe some things are changed (i.e. gram -> falkon), but
> > fundamentally the problems are the same (different scales though)
> > 7. GOTO 1
> 
> Yes, as often as needed. Its iteration, but not endless, if done 
> thoughtfully.

Only if there's any learning. But the conflict between what we think is
achievable and what we can achieve seems to remain. That's pretty much
the problem: instead of trying to reconcile these, we keep saying that
the other side is wrong, and either the other side fails to provide some
proof of that or we refuse to listen to (or don't care about) the other
side because we *know* we are right.

Pretty much like the feedback system described a while ago, but this one
has a very low dampening factor.

Mihael

> 
> > 
> >> This is true for throttling both successful and failing jobs.
> 
> I agree.
> 
> >>
> >> Right now MolDyn progress is being impeded by a situation where a single 
> >> bad cluster node (with stale FS file handles) has an unduly negative 
> >> impact on overall workflow performance.
> > 
> > Yes. And this is how things work. There are problems. It's a statement
> > of fact.
> > 
> >> I feel that before we discuss and work on the nuances of throttling 
> >> algorithms (which will take some time to perfect) we should provide a 
> >> simple and reliable way for the user to override the default heuristics 
> >> and achieve good performance in situations that are currently occurring.
> > 
> > Groovy. Would the above (all throttling parameters in swift.properties
> > and the "off" option for each) work?
> 
> Yes, I think so - again, please (re)re-iterate what they are, please. :)
> 
> > 
> >> How much work it would take to provide a config parameter that causes 
> >> failed jobs to get retried immediately with no delay or scheduling 
> >> penalty? I.e., let the user set the "failure penalty" ratio to reduce or 
> >> eliminate the penalty for failures.
> > 
> > I'd suggest simply not throttling on such things. 
> 
> Agreed.  Cool.
> 
> > 
> > There can also be an option for tweaking the factors, but I have at
> > least one small adversion towards having too many things in
> > swift.properties.
> 
> Sounds reasonable. Lets start with the basics.
> 
> Now, having said all this - perhaps Ioan can catch and retry the failure 
> all in falkon.  Is wrapper.sh capable of getting re-run on a different 
> node of the same cluster?  (If not I think we can enance it to be).
> 
> Thanks,
> 
> Mike
> 
> > 
> > Mihael
> > 
> >> Its possible that once we have this control, we'd need a few other 
> >> parameters to make reasonable things happen in the case of running on 
> >> one or more Falkon sites.
> >>
> >> In tandem with this, Falkon will provide parameters to control what 
> >> happens to a node after a failure:
> >> - a failure analyzer will attempt to recognize node failures as opposed 
> >> to app failures (some of this may need to go into the Swift launcher, 
> >> wrapper.sh
> >> - on known node failures Falkon will log the failure to bring to 
> >> sysadmin attention, and will also leave the node held
> >> - In the future falcon will add new nodes to compensate for nodes that 
> >> it has disabled.
> >>
> >> I'd like to ask that we focus discussion on what is needed to design and 
> >> implement these basic changes, and whether they would solve the current 
> >> problems and be useful in general.
> >>
> >> - Mike
> >>
> >>
> >>
> >>
> >>
> >> Mihael Hategan wrote:
> >>> On Mon, 2007-08-27 at 13:25 -0500, Ioan Raicu wrote:
> >>>> The question I am interested in, can you modify the heuristic to take
> >>>> into account the execution time of tasks when updating the site score?
> >>> I thought I mentioned I can.
> >>>
> >>>>   I think it is important you use only the execution time (and not
> >>>> Falkon queue time + execution time + result delivery time); in this
> >>>> case, how does Falkon pass this information back to Swift?
> >>> I thought I mentioned why that's not a good idea. Here's a short
> >>> version:
> >>> If Falkon is slow for some reason, that needs to be taken into account.
> >>> Excluding it from measurements under the assumption that it will always
> >>> be fast is not a particularly good idea. And if it is always fast then
> >>> it doesn't matter much since it won't add much overhead.
> >>>
> >>>> Ioan
> >>>>
> >>>> Mihael Hategan wrote: 
> >>>>> On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:
> >>>>>   
> >>>>>> On Mon, 27 Aug 2007, Ioan Raicu wrote:
> >>>>>>
> >>>>>>     
> >>>>>>> On a similar note, IMO, the heuristic in Karajan should be modified to take
> >>>>>>> into account the task execution time of the failed or successful task, and not
> >>>>>>> just the number of tasks.  This would ensure that Swift is not throttling task
> >>>>>>> submission to Falkon when there are 1000s of successful tasks that take on the
> >>>>>>> order of 100s of second to complete, yet there are also 1000s of failed tasks
> >>>>>>> that are only 10 ms long.  This is exactly the case with MolDyn, when we get a
> >>>>>>> bad node in a bunch of 100s of nodes, which ends up throttling the number of
> >>>>>>> active and running tasks to about 100, regardless of the number of processors
> >>>>>>> Falkon has. 
> >>>>>>>       
> >>>>>> Is that different from when submitting to PBS or GRAM where there are 
> >>>>>> 1000s of successful tasks taking 100s of seconds to complete but with 
> >>>>>> 1000s of failed tasks that are only 10ms long?
> >>>>>>     
> >>>>> In your scenario, assuming that GRAM and PBS do work (since some jobs
> >>>>> succeed), then you can't really submit that fast. So the same thing
> >>>>> would happen, but slower. Unfortunately, in the PBS case, there's not
> >>>>> much that can be done but to throttle until no more jobs than good nodes
> >>>>> are being run at one time.
> >>>>>
> >>>>> Now, there is the probing part, which makes the system start with a
> >>>>> lower throttle which increases until problems appear. If this is
> >>>>> disabled (as it was in the ModDyn run), large numbers of parallel jobs
> >>>>> will be submitted causing a large number of failures.
> >>>>>
> >>>>> So this whole thing is close to a linear system with negative feedback.
> >>>>> If the initial state is very far away from stability, there will be
> >>>>> large transients. You're more than welcome to study how to make it
> >>>>> converge faster, or how to guess the initial state better (knowing the
> >>>>> number of nodes a cluster has would be a step).
> >>>>>
> >>>>>   
> >>>>>
> >>>>>
> >>>>>   
> >>>> -- 
> >>>> ============================================
> >>>> Ioan Raicu
> >>>> Ph.D. Student
> >>>> ============================================
> >>>> Distributed Systems Laboratory
> >>>> Computer Science Department
> >>>> University of Chicago
> >>>> 1100 E. 58th Street, Ryerson Hall
> >>>> Chicago, IL 60637
> >>>> ============================================
> >>>> Email: iraicu at cs.uchicago.edu
> >>>> Web:   http://www.cs.uchicago.edu/~iraicu
> >>>>        http://dsl.cs.uchicago.edu/
> >>>> ============================================
> >>>> ============================================
> >>>
> > 
> > 
>