[Swift-devel] Re: 244 MolDyn run was successful!

Ioan Raicu iraicu at cs.uchicago.edu
Mon Aug 27 13:25:30 CDT 2007


The question I am interested in, can you modify the heuristic to take 
into account the execution time of tasks when updating the site score?  
I think it is important you use only the execution time (and not Falkon 
queue time + execution time + result delivery time); in this case, how 
does Falkon pass this information back to Swift?

Ioan

Mihael Hategan wrote:
> On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:
>   
>> On Mon, 27 Aug 2007, Ioan Raicu wrote:
>>
>>     
>>> On a similar note, IMO, the heuristic in Karajan should be modified to take
>>> into account the task execution time of the failed or successful task, and not
>>> just the number of tasks.  This would ensure that Swift is not throttling task
>>> submission to Falkon when there are 1000s of successful tasks that take on the
>>> order of 100s of second to complete, yet there are also 1000s of failed tasks
>>> that are only 10 ms long.  This is exactly the case with MolDyn, when we get a
>>> bad node in a bunch of 100s of nodes, which ends up throttling the number of
>>> active and running tasks to about 100, regardless of the number of processors
>>> Falkon has. 
>>>       
>> Is that different from when submitting to PBS or GRAM where there are 
>> 1000s of successful tasks taking 100s of seconds to complete but with 
>> 1000s of failed tasks that are only 10ms long?
>>     
>
> In your scenario, assuming that GRAM and PBS do work (since some jobs
> succeed), then you can't really submit that fast. So the same thing
> would happen, but slower. Unfortunately, in the PBS case, there's not
> much that can be done but to throttle until no more jobs than good nodes
> are being run at one time.
>
> Now, there is the probing part, which makes the system start with a
> lower throttle which increases until problems appear. If this is
> disabled (as it was in the ModDyn run), large numbers of parallel jobs
> will be submitted causing a large number of failures.
>
> So this whole thing is close to a linear system with negative feedback.
> If the initial state is very far away from stability, there will be
> large transients. You're more than welcome to study how to make it
> converge faster, or how to guess the initial state better (knowing the
> number of nodes a cluster has would be a step).
>
>   
>
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070827/8eeb630f/attachment.html>


More information about the Swift-devel mailing list