[Swift-devel] Re: Request for control over throttle algorithm

Mon Aug 27 16:15:20 CDT 2007

Mihael Hategan wrote:
> On Mon, 2007-08-27 at 15:07 -0500, Michael Wilde wrote:
>> [changing subject line to start a new thread]
>>
>> Mihael, all,
>>
>> I'm observing again that Karajan job throttling algorithms need more 
>> discussion, design and testing, and that in the meantime - and perhaps 
>> always - we need simple ways to override the algorithms and manually 
>> control the throttle.
> 
> Here's what happens:
> 1. somebody says "I don't like throttling because it decreases the
> performance" (that's what throttles do, in order to make things not
> fail)

No. What was said was: We are trying to get a workflow running for a 
real science user - on whose success we depend on.  And in the process 
of doing that, the current obstacle to good performance is a 
failure-retry behavior that is not working well.

> 2. we collectively conclude that we should disable throttling

Several of us believe that in *this* case it will enable the workflow to 
*finally* succeed and will also yield better performance.  Note that the 
default settings do not even let the workflow complete successully.

> 3. there are options to change those in swift.properties (and one in
> scheduler.xml which I will also add to swift.properties), and they are
> increased to "virtually off" numbers (I need to add an explicit "off" to
> make things easier)

This is great - just what we need.  But I think Ioan cant find the prior 
email in which you describe them, and I couldnt either.  Could you 
re-state what to set, please?

> 4. the workflows still don't work very well because there are lots of
> failures now, and quality drops

That would be a different scenario.  In this case, Ioan will try to take 
the offending node(s) out of service as seen by Falkon.

> 5. throttles are set back to reasonable values

Yes, thats the goal.  I believe that automated failure handling is 
difficult and takes a while - lots of design, measurement, test, improve 
- before they work well.  Certainly the internet and TCP/IP teaches us 
that.  Critical, necessary, but a long road.

> 6. maybe some things are changed (i.e. gram -> falkon), but
> fundamentally the problems are the same (different scales though)
> 7. GOTO 1

Yes, as often as needed. Its iteration, but not endless, if done 
thoughtfully.

> 
>> This is true for throttling both successful and failing jobs.

I agree.

>>
>> Right now MolDyn progress is being impeded by a situation where a single 
>> bad cluster node (with stale FS file handles) has an unduly negative 
>> impact on overall workflow performance.
> 
> Yes. And this is how things work. There are problems. It's a statement
> of fact.
> 
>> I feel that before we discuss and work on the nuances of throttling 
>> algorithms (which will take some time to perfect) we should provide a 
>> simple and reliable way for the user to override the default heuristics 
>> and achieve good performance in situations that are currently occurring.
> 
> Groovy. Would the above (all throttling parameters in swift.properties
> and the "off" option for each) work?

Yes, I think so - again, please (re)re-iterate what they are, please. :)

> 
>> How much work it would take to provide a config parameter that causes 
>> failed jobs to get retried immediately with no delay or scheduling 
>> penalty? I.e., let the user set the "failure penalty" ratio to reduce or 
>> eliminate the penalty for failures.
> 
> I'd suggest simply not throttling on such things. 

Agreed.  Cool.

> 
> There can also be an option for tweaking the factors, but I have at
> least one small adversion towards having too many things in
> swift.properties.

Sounds reasonable. Lets start with the basics.

Now, having said all this - perhaps Ioan can catch and retry the failure 
all in falkon.  Is wrapper.sh capable of getting re-run on a different 
node of the same cluster?  (If not I think we can enance it to be).

Thanks,

Mike

> 
> Mihael
> 
>> Its possible that once we have this control, we'd need a few other 
>> parameters to make reasonable things happen in the case of running on 
>> one or more Falkon sites.
>>
>> In tandem with this, Falkon will provide parameters to control what 
>> happens to a node after a failure:
>> - a failure analyzer will attempt to recognize node failures as opposed 
>> to app failures (some of this may need to go into the Swift launcher, 
>> wrapper.sh
>> - on known node failures Falkon will log the failure to bring to 
>> sysadmin attention, and will also leave the node held
>> - In the future falcon will add new nodes to compensate for nodes that 
>> it has disabled.
>>
>> I'd like to ask that we focus discussion on what is needed to design and 
>> implement these basic changes, and whether they would solve the current 
>> problems and be useful in general.
>>
>> - Mike
>>
>>
>>
>>
>>
>> Mihael Hategan wrote:
>>> On Mon, 2007-08-27 at 13:25 -0500, Ioan Raicu wrote:
>>>> The question I am interested in, can you modify the heuristic to take
>>>> into account the execution time of tasks when updating the site score?
>>> I thought I mentioned I can.
>>>
>>>>   I think it is important you use only the execution time (and not
>>>> Falkon queue time + execution time + result delivery time); in this
>>>> case, how does Falkon pass this information back to Swift?
>>> I thought I mentioned why that's not a good idea. Here's a short
>>> version:
>>> If Falkon is slow for some reason, that needs to be taken into account.
>>> Excluding it from measurements under the assumption that it will always
>>> be fast is not a particularly good idea. And if it is always fast then
>>> it doesn't matter much since it won't add much overhead.
>>>
>>>> Ioan
>>>>
>>>> Mihael Hategan wrote: 
>>>>> On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:
>>>>>   
>>>>>> On Mon, 27 Aug 2007, Ioan Raicu wrote:
>>>>>>
>>>>>>     
>>>>>>> On a similar note, IMO, the heuristic in Karajan should be modified to take
>>>>>>> into account the task execution time of the failed or successful task, and not
>>>>>>> just the number of tasks.  This would ensure that Swift is not throttling task
>>>>>>> submission to Falkon when there are 1000s of successful tasks that take on the
>>>>>>> order of 100s of second to complete, yet there are also 1000s of failed tasks
>>>>>>> that are only 10 ms long.  This is exactly the case with MolDyn, when we get a
>>>>>>> bad node in a bunch of 100s of nodes, which ends up throttling the number of
>>>>>>> active and running tasks to about 100, regardless of the number of processors
>>>>>>> Falkon has. 
>>>>>>>       
>>>>>> Is that different from when submitting to PBS or GRAM where there are 
>>>>>> 1000s of successful tasks taking 100s of seconds to complete but with 
>>>>>> 1000s of failed tasks that are only 10ms long?
>>>>>>     
>>>>> In your scenario, assuming that GRAM and PBS do work (since some jobs
>>>>> succeed), then you can't really submit that fast. So the same thing
>>>>> would happen, but slower. Unfortunately, in the PBS case, there's not
>>>>> much that can be done but to throttle until no more jobs than good nodes
>>>>> are being run at one time.
>>>>>
>>>>> Now, there is the probing part, which makes the system start with a
>>>>> lower throttle which increases until problems appear. If this is
>>>>> disabled (as it was in the ModDyn run), large numbers of parallel jobs
>>>>> will be submitted causing a large number of failures.
>>>>>
>>>>> So this whole thing is close to a linear system with negative feedback.
>>>>> If the initial state is very far away from stability, there will be
>>>>> large transients. You're more than welcome to study how to make it
>>>>> converge faster, or how to guess the initial state better (knowing the
>>>>> number of nodes a cluster has would be a step).
>>>>>
>>>>>   
>>>>>
>>>>>
>>>>>   
>>>> -- 
>>>> ============================================
>>>> Ioan Raicu
>>>> Ph.D. Student
>>>> ============================================
>>>> Distributed Systems Laboratory
>>>> Computer Science Department
>>>> University of Chicago
>>>> 1100 E. 58th Street, Ryerson Hall
>>>> Chicago, IL 60637
>>>> ============================================
>>>> Email: iraicu at cs.uchicago.edu
>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>>        http://dsl.cs.uchicago.edu/
>>>> ============================================
>>>> ============================================
>>>
> 
>