[Swift-devel] Re: 244 MolDyn run was successful!

Mon Aug 27 14:40:41 CDT 2007

It's still not clear to me why Karajan is throttling at all when working 
with Falkon. I've asked this question before, and I don't recall 
receiving a satisfactory answer. So far at least, this behavior has just 
created problems for us. Can we turn it off?

Ian.

Ioan Raicu wrote:
> The question I am interested in, can you modify the heuristic to take 
> into account the execution time of tasks when updating the site 
> score?  I think it is important you use only the execution time (and 
> not Falkon queue time + execution time + result delivery time); in 
> this case, how does Falkon pass this information back to Swift?
>
> Ioan
>
> Mihael Hategan wrote:
>> On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:
>>   
>>> On Mon, 27 Aug 2007, Ioan Raicu wrote:
>>>
>>>     
>>>> On a similar note, IMO, the heuristic in Karajan should be modified to take
>>>> into account the task execution time of the failed or successful task, and not
>>>> just the number of tasks.  This would ensure that Swift is not throttling task
>>>> submission to Falkon when there are 1000s of successful tasks that take on the
>>>> order of 100s of second to complete, yet there are also 1000s of failed tasks
>>>> that are only 10 ms long.  This is exactly the case with MolDyn, when we get a
>>>> bad node in a bunch of 100s of nodes, which ends up throttling the number of
>>>> active and running tasks to about 100, regardless of the number of processors
>>>> Falkon has. 
>>>>       
>>> Is that different from when submitting to PBS or GRAM where there are 
>>> 1000s of successful tasks taking 100s of seconds to complete but with 
>>> 1000s of failed tasks that are only 10ms long?
>>>     
>>
>> In your scenario, assuming that GRAM and PBS do work (since some jobs
>> succeed), then you can't really submit that fast. So the same thing
>> would happen, but slower. Unfortunately, in the PBS case, there's not
>> much that can be done but to throttle until no more jobs than good nodes
>> are being run at one time.
>>
>> Now, there is the probing part, which makes the system start with a
>> lower throttle which increases until problems appear. If this is
>> disabled (as it was in the ModDyn run), large numbers of parallel jobs
>> will be submitted causing a large number of failures.
>>
>> So this whole thing is close to a linear system with negative feedback.
>> If the initial state is very far away from stability, there will be
>> large transients. You're more than welcome to study how to make it
>> converge faster, or how to guess the initial state better (knowing the
>> number of nodes a cluster has would be a step).
>>
>>   
>>
>>
>>   
>
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070827/ff9a34f5/attachment.html>