[Swift-devel] Re: [Swft] Q about throttling

Mike Wilde wilde at mcs.anl.gov
Fri Jun 22 15:27:57 CDT 2007


[forgot to hit send on this - my apology if its no longer relevant]

OK, thanks, Yong.

Regarding the retry delay, I phrased the question poorly. I meant:

Is it possible that the 2500 failing jobs are being retried too slowly? Ie that 
Karajan delays each re-run after a failure, and thus cant keep Falkon fed with 
retried jobs at a high rate?

- Mike


Yong Zhao wrote, On 6/22/2007 9:45 AM:
> The retry mechanism is currently in some karajan script, and we can easily
> add some delay there.
> 
> There is not a configuration option to disable pipeline. I did that
> manually (modified some code segment) to get a perf chart.
> 
> Yong.
> 
> On Fri, 22 Jun 2007, Mike Wilde wrote:
> 
>> Is there a configurable retry delay after failure?
>>
>> I think you need to examine the overall workflow dependency structure.
>>
>> Also, I recall from older perf charts that there's an option to enable/disable
>> pipelining.  With pipelining disabled, it seems that Swift will wait for an
>> entire dataset/foreach or procedure to finish before starting any tasks that
>> depend on the foreach or procedure.
>>
>> Mihael, can you look at some of these issues when you are back online and rested?
>>
>> - Mike
>>
>> Ioan Raicu wrote, On 6/22/2007 9:06 AM:
>>> No, I didn't keep track of this info, unless Swift does this through
>>> some of its logs.
>>>
>>> Over the last week, my observations have been the following: Swift is
>>> more than capable and willing to send out many tasks as long as they are
>>> independent (as can be seen in this graph where probably 6800 tasks got
>>> submitted), but thereafter, it had no other burst of task submission,
>>> although I believe it could have send out more.  For example, there were
>>> 2500+ tasks that failed in the middle of those 6800 tasks (which were
>>> all independent), why were 2500 tasks not resubmitted all at once...
>>> they were each about 200 seconds long, so most of them should have
>>> certainly showed up in the wait queue.
>>>
>>> Ioan
>>>
>>> Ben Clifford wrote:
>>>>> kept busy, and the Falkon queue length was relatively at 0... so this means
>>>>> that Swift was not submitting fast enough to keep all the executors busy.
>>>>>
>>>> interesting. though around t=1000 there is a rapid burst of submission
>>>> getting the queue length up to about 6000 in a few minutes.
>>>>
>>>> Do you know what the cpu time usage of the swift submitting JVM was over
>>>> that time period?
>>>>
>>>>
>>> --
>>> ============================================
>>> Ioan Raicu
>>> Ph.D. Student
>>> ============================================
>>> Distributed Systems Laboratory
>>> Computer Science Department
>>> University of Chicago
>>> 1100 E. 58th Street, Ryerson Hall
>>> Chicago, IL 60637
>>> ============================================
>>> Email: iraicu at cs.uchicago.edu
>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>        http://dsl.cs.uchicago.edu/
>>> ============================================
>>> ============================================
>>>
>> --
>> Mike Wilde
>> Computation Institute, University of Chicago
>> Math & Computer Science Division
>> Argonne National Laboratory
>> Argonne, IL   60439    USA
>> tel 630-252-7497 fax 630-252-1997
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997



More information about the Swift-devel mailing list