[Swift-devel] Request for control over throttle algorithm

Mon Aug 27 15:07:38 CDT 2007

[changing subject line to start a new thread]

Mihael, all,

I'm observing again that Karajan job throttling algorithms need more 
discussion, design and testing, and that in the meantime - and perhaps 
always - we need simple ways to override the algorithms and manually 
control the throttle.

This is true for throttling both successful and failing jobs.

Right now MolDyn progress is being impeded by a situation where a single 
bad cluster node (with stale FS file handles) has an unduly negative 
impact on overall workflow performance.

I feel that before we discuss and work on the nuances of throttling 
algorithms (which will take some time to perfect) we should provide a 
simple and reliable way for the user to override the default heuristics 
and achieve good performance in situations that are currently occurring.

How much work it would take to provide a config parameter that causes 
failed jobs to get retried immediately with no delay or scheduling 
penalty? I.e., let the user set the "failure penalty" ratio to reduce or 
eliminate the penalty for failures.

Its possible that once we have this control, we'd need a few other 
parameters to make reasonable things happen in the case of running on 
one or more Falkon sites.

In tandem with this, Falkon will provide parameters to control what 
happens to a node after a failure:
- a failure analyzer will attempt to recognize node failures as opposed 
to app failures (some of this may need to go into the Swift launcher, 
wrapper.sh
- on known node failures Falkon will log the failure to bring to 
sysadmin attention, and will also leave the node held
- In the future falcon will add new nodes to compensate for nodes that 
it has disabled.

I'd like to ask that we focus discussion on what is needed to design and 
implement these basic changes, and whether they would solve the current 
problems and be useful in general.

- Mike

Mihael Hategan wrote:
> On Mon, 2007-08-27 at 13:25 -0500, Ioan Raicu wrote:
>> The question I am interested in, can you modify the heuristic to take
>> into account the execution time of tasks when updating the site score?
> 
> I thought I mentioned I can.
> 
>>   I think it is important you use only the execution time (and not
>> Falkon queue time + execution time + result delivery time); in this
>> case, how does Falkon pass this information back to Swift?
> 
> I thought I mentioned why that's not a good idea. Here's a short
> version:
> If Falkon is slow for some reason, that needs to be taken into account.
> Excluding it from measurements under the assumption that it will always
> be fast is not a particularly good idea. And if it is always fast then
> it doesn't matter much since it won't add much overhead.
> 
>> Ioan
>>
>> Mihael Hategan wrote: 
>>> On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:
>>>   
>>>> On Mon, 27 Aug 2007, Ioan Raicu wrote:
>>>>
>>>>     
>>>>> On a similar note, IMO, the heuristic in Karajan should be modified to take
>>>>> into account the task execution time of the failed or successful task, and not
>>>>> just the number of tasks.  This would ensure that Swift is not throttling task
>>>>> submission to Falkon when there are 1000s of successful tasks that take on the
>>>>> order of 100s of second to complete, yet there are also 1000s of failed tasks
>>>>> that are only 10 ms long.  This is exactly the case with MolDyn, when we get a
>>>>> bad node in a bunch of 100s of nodes, which ends up throttling the number of
>>>>> active and running tasks to about 100, regardless of the number of processors
>>>>> Falkon has. 
>>>>>       
>>>> Is that different from when submitting to PBS or GRAM where there are 
>>>> 1000s of successful tasks taking 100s of seconds to complete but with 
>>>> 1000s of failed tasks that are only 10ms long?
>>>>     
>>> In your scenario, assuming that GRAM and PBS do work (since some jobs
>>> succeed), then you can't really submit that fast. So the same thing
>>> would happen, but slower. Unfortunately, in the PBS case, there's not
>>> much that can be done but to throttle until no more jobs than good nodes
>>> are being run at one time.
>>>
>>> Now, there is the probing part, which makes the system start with a
>>> lower throttle which increases until problems appear. If this is
>>> disabled (as it was in the ModDyn run), large numbers of parallel jobs
>>> will be submitted causing a large number of failures.
>>>
>>> So this whole thing is close to a linear system with negative feedback.
>>> If the initial state is very far away from stability, there will be
>>> large transients. You're more than welcome to study how to make it
>>> converge faster, or how to guess the initial state better (knowing the
>>> number of nodes a cluster has would be a step).
>>>
>>>   
>>>
>>>
>>>   
>> -- 
>> ============================================
>> Ioan Raicu
>> Ph.D. Student
>> ============================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ============================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>>        http://dsl.cs.uchicago.edu/
>> ============================================
>> ============================================
> 
>