[Swift-devel] Request for control over throttle algorithm

Ian Foster foster at mcs.anl.gov
Mon Aug 27 16:21:12 CDT 2007


Yes, well put.

I appreciate how important throttling is in many circumstances, and the 
care and thought that has gone into its design.

It's just that "running with a single Falkon-controlled site" is not one 
of those circumstances where throttling is useful. It's a special case, 
certainly, but an important one at present.

Ian.

Michael Wilde wrote:
> [changing subject line to start a new thread]
>
> Mihael, all,
>
> I'm observing again that Karajan job throttling algorithms need more 
> discussion, design and testing, and that in the meantime - and perhaps 
> always - we need simple ways to override the algorithms and manually 
> control the throttle.
>
> This is true for throttling both successful and failing jobs.
>
> Right now MolDyn progress is being impeded by a situation where a 
> single bad cluster node (with stale FS file handles) has an unduly 
> negative impact on overall workflow performance.
>
> I feel that before we discuss and work on the nuances of throttling 
> algorithms (which will take some time to perfect) we should provide a 
> simple and reliable way for the user to override the default 
> heuristics and achieve good performance in situations that are 
> currently occurring.
>
> How much work it would take to provide a config parameter that causes 
> failed jobs to get retried immediately with no delay or scheduling 
> penalty? I.e., let the user set the "failure penalty" ratio to reduce 
> or eliminate the penalty for failures.
>
> Its possible that once we have this control, we'd need a few other 
> parameters to make reasonable things happen in the case of running on 
> one or more Falkon sites.
>
> In tandem with this, Falkon will provide parameters to control what 
> happens to a node after a failure:
> - a failure analyzer will attempt to recognize node failures as 
> opposed to app failures (some of this may need to go into the Swift 
> launcher, wrapper.sh
> - on known node failures Falkon will log the failure to bring to 
> sysadmin attention, and will also leave the node held
> - In the future falcon will add new nodes to compensate for nodes that 
> it has disabled.
>
> I'd like to ask that we focus discussion on what is needed to design 
> and implement these basic changes, and whether they would solve the 
> current problems and be useful in general.
>
> - Mike
>
>
>
>
>
> Mihael Hategan wrote:
>> On Mon, 2007-08-27 at 13:25 -0500, Ioan Raicu wrote:
>>> The question I am interested in, can you modify the heuristic to take
>>> into account the execution time of tasks when updating the site score?
>>
>> I thought I mentioned I can.
>>
>>>   I think it is important you use only the execution time (and not
>>> Falkon queue time + execution time + result delivery time); in this
>>> case, how does Falkon pass this information back to Swift?
>>
>> I thought I mentioned why that's not a good idea. Here's a short
>> version:
>> If Falkon is slow for some reason, that needs to be taken into account.
>> Excluding it from measurements under the assumption that it will always
>> be fast is not a particularly good idea. And if it is always fast then
>> it doesn't matter much since it won't add much overhead.
>>
>>> Ioan
>>>
>>> Mihael Hategan wrote:
>>>> On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:
>>>>  
>>>>> On Mon, 27 Aug 2007, Ioan Raicu wrote:
>>>>>
>>>>>    
>>>>>> On a similar note, IMO, the heuristic in Karajan should be 
>>>>>> modified to take
>>>>>> into account the task execution time of the failed or successful 
>>>>>> task, and not
>>>>>> just the number of tasks.  This would ensure that Swift is not 
>>>>>> throttling task
>>>>>> submission to Falkon when there are 1000s of successful tasks 
>>>>>> that take on the
>>>>>> order of 100s of second to complete, yet there are also 1000s of 
>>>>>> failed tasks
>>>>>> that are only 10 ms long.  This is exactly the case with MolDyn, 
>>>>>> when we get a
>>>>>> bad node in a bunch of 100s of nodes, which ends up throttling 
>>>>>> the number of
>>>>>> active and running tasks to about 100, regardless of the number 
>>>>>> of processors
>>>>>> Falkon has.       
>>>>> Is that different from when submitting to PBS or GRAM where there 
>>>>> are 1000s of successful tasks taking 100s of seconds to complete 
>>>>> but with 1000s of failed tasks that are only 10ms long?
>>>>>     
>>>> In your scenario, assuming that GRAM and PBS do work (since some jobs
>>>> succeed), then you can't really submit that fast. So the same thing
>>>> would happen, but slower. Unfortunately, in the PBS case, there's not
>>>> much that can be done but to throttle until no more jobs than good 
>>>> nodes
>>>> are being run at one time.
>>>>
>>>> Now, there is the probing part, which makes the system start with a
>>>> lower throttle which increases until problems appear. If this is
>>>> disabled (as it was in the ModDyn run), large numbers of parallel jobs
>>>> will be submitted causing a large number of failures.
>>>>
>>>> So this whole thing is close to a linear system with negative 
>>>> feedback.
>>>> If the initial state is very far away from stability, there will be
>>>> large transients. You're more than welcome to study how to make it
>>>> converge faster, or how to guess the initial state better (knowing the
>>>> number of nodes a cluster has would be a step).
>>>>
>>>>  
>>>>
>>>>   
>>> -- 
>>> ============================================
>>> Ioan Raicu
>>> Ph.D. Student
>>> ============================================
>>> Distributed Systems Laboratory
>>> Computer Science Department
>>> University of Chicago
>>> 1100 E. 58th Street, Ryerson Hall
>>> Chicago, IL 60637
>>> ============================================
>>> Email: iraicu at cs.uchicago.edu
>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>        http://dsl.cs.uchicago.edu/
>>> ============================================
>>> ============================================
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.




More information about the Swift-devel mailing list