[Swift-devel] Re: 244 MolDyn run was successful!

Mihael Hategan hategan at mcs.anl.gov
Mon Aug 27 13:47:46 CDT 2007


On Mon, 2007-08-27 at 13:25 -0500, Ioan Raicu wrote:
> The question I am interested in, can you modify the heuristic to take
> into account the execution time of tasks when updating the site score?

I thought I mentioned I can.

>   I think it is important you use only the execution time (and not
> Falkon queue time + execution time + result delivery time); in this
> case, how does Falkon pass this information back to Swift?

I thought I mentioned why that's not a good idea. Here's a short
version:
If Falkon is slow for some reason, that needs to be taken into account.
Excluding it from measurements under the assumption that it will always
be fast is not a particularly good idea. And if it is always fast then
it doesn't matter much since it won't add much overhead.

> 
> Ioan
> 
> Mihael Hategan wrote: 
> > On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:
> >   
> > > On Mon, 27 Aug 2007, Ioan Raicu wrote:
> > > 
> > >     
> > > > On a similar note, IMO, the heuristic in Karajan should be modified to take
> > > > into account the task execution time of the failed or successful task, and not
> > > > just the number of tasks.  This would ensure that Swift is not throttling task
> > > > submission to Falkon when there are 1000s of successful tasks that take on the
> > > > order of 100s of second to complete, yet there are also 1000s of failed tasks
> > > > that are only 10 ms long.  This is exactly the case with MolDyn, when we get a
> > > > bad node in a bunch of 100s of nodes, which ends up throttling the number of
> > > > active and running tasks to about 100, regardless of the number of processors
> > > > Falkon has. 
> > > >       
> > > Is that different from when submitting to PBS or GRAM where there are 
> > > 1000s of successful tasks taking 100s of seconds to complete but with 
> > > 1000s of failed tasks that are only 10ms long?
> > >     
> > 
> > In your scenario, assuming that GRAM and PBS do work (since some jobs
> > succeed), then you can't really submit that fast. So the same thing
> > would happen, but slower. Unfortunately, in the PBS case, there's not
> > much that can be done but to throttle until no more jobs than good nodes
> > are being run at one time.
> > 
> > Now, there is the probing part, which makes the system start with a
> > lower throttle which increases until problems appear. If this is
> > disabled (as it was in the ModDyn run), large numbers of parallel jobs
> > will be submitted causing a large number of failures.
> > 
> > So this whole thing is close to a linear system with negative feedback.
> > If the initial state is very far away from stability, there will be
> > large transients. You're more than welcome to study how to make it
> > converge faster, or how to guess the initial state better (knowing the
> > number of nodes a cluster has would be a step).
> > 
> >   
> > 
> > 
> >   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================




More information about the Swift-devel mailing list