[Swift-devel] Re: [Swft] Q about throttling

Mihael Hategan hategan at mcs.anl.gov
Sat Jun 23 15:47:01 CDT 2007


On Fri, 2007-06-22 at 15:27 -0500, Mike Wilde wrote:
> [forgot to hit send on this - my apology if its no longer relevant]
> 
> OK, thanks, Yong.
> 
> Regarding the retry delay, I phrased the question poorly. I meant:
> 
> Is it possible that the 2500 failing jobs are being retried too slowly? Ie that 
> Karajan delays each re-run after a failure, and thus cant keep Falkon fed with 
> retried jobs at a high rate?

It does not explicitly delay anything. But 2500*[many things to do]
becomes visible.

> 
> - Mike
> 
> 
> Yong Zhao wrote, On 6/22/2007 9:45 AM:
> > The retry mechanism is currently in some karajan script, and we can easily
> > add some delay there.
> > 
> > There is not a configuration option to disable pipeline. I did that
> > manually (modified some code segment) to get a perf chart.
> > 
> > Yong.
> > 
> > On Fri, 22 Jun 2007, Mike Wilde wrote:
> > 
> >> Is there a configurable retry delay after failure?
> >>
> >> I think you need to examine the overall workflow dependency structure.
> >>
> >> Also, I recall from older perf charts that there's an option to enable/disable
> >> pipelining.  With pipelining disabled, it seems that Swift will wait for an
> >> entire dataset/foreach or procedure to finish before starting any tasks that
> >> depend on the foreach or procedure.
> >>
> >> Mihael, can you look at some of these issues when you are back online and rested?
> >>
> >> - Mike
> >>
> >> Ioan Raicu wrote, On 6/22/2007 9:06 AM:
> >>> No, I didn't keep track of this info, unless Swift does this through
> >>> some of its logs.
> >>>
> >>> Over the last week, my observations have been the following: Swift is
> >>> more than capable and willing to send out many tasks as long as they are
> >>> independent (as can be seen in this graph where probably 6800 tasks got
> >>> submitted), but thereafter, it had no other burst of task submission,
> >>> although I believe it could have send out more.  For example, there were
> >>> 2500+ tasks that failed in the middle of those 6800 tasks (which were
> >>> all independent), why were 2500 tasks not resubmitted all at once...
> >>> they were each about 200 seconds long, so most of them should have
> >>> certainly showed up in the wait queue.
> >>>
> >>> Ioan
> >>>
> >>> Ben Clifford wrote:
> >>>>> kept busy, and the Falkon queue length was relatively at 0... so this means
> >>>>> that Swift was not submitting fast enough to keep all the executors busy.
> >>>>>
> >>>> interesting. though around t=1000 there is a rapid burst of submission
> >>>> getting the queue length up to about 6000 in a few minutes.
> >>>>
> >>>> Do you know what the cpu time usage of the swift submitting JVM was over
> >>>> that time period?
> >>>>
> >>>>
> >>> --
> >>> ============================================
> >>> Ioan Raicu
> >>> Ph.D. Student
> >>> ============================================
> >>> Distributed Systems Laboratory
> >>> Computer Science Department
> >>> University of Chicago
> >>> 1100 E. 58th Street, Ryerson Hall
> >>> Chicago, IL 60637
> >>> ============================================
> >>> Email: iraicu at cs.uchicago.edu
> >>> Web:   http://www.cs.uchicago.edu/~iraicu
> >>>        http://dsl.cs.uchicago.edu/
> >>> ============================================
> >>> ============================================
> >>>
> >> --
> >> Mike Wilde
> >> Computation Institute, University of Chicago
> >> Math & Computer Science Division
> >> Argonne National Laboratory
> >> Argonne, IL   60439    USA
> >> tel 630-252-7497 fax 630-252-1997
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> > 
> > 
> 




More information about the Swift-devel mailing list