[Swift-devel] Re: 244 MolDyn run was successful!

Mihael Hategan hategan at mcs.anl.gov
Mon Aug 27 15:04:00 CDT 2007


On Mon, 2007-08-27 at 14:40 -0500, Ian Foster wrote:
> It's still not clear to me why Karajan is throttling at all when
> working with Falkon. I've asked this question before, and I don't
> recall receiving a satisfactory answer. So far at least, this behavior
> has just created problems for us.

The suggestion that throttling has created "just" problems for us is,
I'd say, misleading and unnecessary. We're discussing exactly the issue
that better (not necessarily more) throttling is needed in order to
prevent the workflow from failing badly.

>  Can we turn it off?

Sure. I mentioned how to do that. Perhaps there should be an "off"
option for each throttling configuration. I'll see to that.

Mihael

> 
> Ian.
> 
> Ioan Raicu wrote: 
> > The question I am interested in, can you modify the heuristic to
> > take into account the execution time of tasks when updating the site
> > score?  I think it is important you use only the execution time (and
> > not Falkon queue time + execution time + result delivery time); in
> > this case, how does Falkon pass this information back to Swift?
> > 
> > Ioan
> > 
> > Mihael Hategan wrote: 
> > > On Mon, 2007-08-27 at 17:37 +0000, Ben Clifford wrote:
> > >   
> > > > On Mon, 27 Aug 2007, Ioan Raicu wrote:
> > > > 
> > > >     
> > > > > On a similar note, IMO, the heuristic in Karajan should be modified to take
> > > > > into account the task execution time of the failed or successful task, and not
> > > > > just the number of tasks.  This would ensure that Swift is not throttling task
> > > > > submission to Falkon when there are 1000s of successful tasks that take on the
> > > > > order of 100s of second to complete, yet there are also 1000s of failed tasks
> > > > > that are only 10 ms long.  This is exactly the case with MolDyn, when we get a
> > > > > bad node in a bunch of 100s of nodes, which ends up throttling the number of
> > > > > active and running tasks to about 100, regardless of the number of processors
> > > > > Falkon has. 
> > > > >       
> > > > Is that different from when submitting to PBS or GRAM where there are 
> > > > 1000s of successful tasks taking 100s of seconds to complete but with 
> > > > 1000s of failed tasks that are only 10ms long?
> > > >     
> > > 
> > > In your scenario, assuming that GRAM and PBS do work (since some jobs
> > > succeed), then you can't really submit that fast. So the same thing
> > > would happen, but slower. Unfortunately, in the PBS case, there's not
> > > much that can be done but to throttle until no more jobs than good nodes
> > > are being run at one time.
> > > 
> > > Now, there is the probing part, which makes the system start with a
> > > lower throttle which increases until problems appear. If this is
> > > disabled (as it was in the ModDyn run), large numbers of parallel jobs
> > > will be submitted causing a large number of failures.
> > > 
> > > So this whole thing is close to a linear system with negative feedback.
> > > If the initial state is very far away from stability, there will be
> > > large transients. You're more than welcome to study how to make it
> > > converge faster, or how to guess the initial state better (knowing the
> > > number of nodes a cluster has would be a step).
> > > 
> > >   
> > > 
> > > 
> > >   
> > 
> > -- 
> > ============================================
> > Ioan Raicu
> > Ph.D. Student
> > ============================================
> > Distributed Systems Laboratory
> > Computer Science Department
> > University of Chicago
> > 1100 E. 58th Street, Ryerson Hall
> > Chicago, IL 60637
> > ============================================
> > Email: iraicu at cs.uchicago.edu
> > Web:   http://www.cs.uchicago.edu/~iraicu
> >        http://dsl.cs.uchicago.edu/
> > ============================================
> > ============================================
> 
> -- 
> 
>    Ian Foster, Director, Computation Institute
> Argonne National Laboratory & University of Chicago
> Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
> Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
> Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
>       Globus Alliance: www.globus.org.




More information about the Swift-devel mailing list