[Swift-devel] Swift throttling

Mon Feb 4 08:13:36 CST 2008

Mihael, Ben - bear with me - I'd like to revisit where we are n throttling.

The following may already be in place, but I think we need to review and 
clarify it, maybe re-assess the numbers:

Seems like for both pre-WS and WS-GRAM we need to stay within two 
roughly-known limits:

- number of jobs submitted per second
- total # of jobs that can be submitted at once

It seems that we need to set limits on these two parameters, *around* 
the slow-start algorithm that tries to sense a sustainable maximum rate 
of job submission.

To what extent is that in the code already, and does it need improvement?

I thought that for pre-WS GRAM the parameters are approximately

- .5 jobs/sec
- < 100 jobs in queue

I realize that these can only be limited on per-workflow basis, but for 
interactions between two workflows, hopefully the slow-start sensing 
algorithms will sense that resource is already under strain and stay at 
a low submission rate.

So what Im suggesting here is:

- we agree on some arbitrary conservative numbers for the moment (till 
we can do more measurement)

- we modify the code to enable explicit limits on the algorithm to be 
set by the user, eg:
  throttle.host.submitlimit - max # jobs that can be queued to a host
  throttle.host.submitrate - max #jobs/sec that can be queued to a host
                             (float)

Does Ti's report of 80 jobs indicates that maybe even 100 jobs in the 
queue is too much (for pre-WS)?

Does this seem reasonable? If not, what is the mechanism by which we can 
reliably avoid over-running a site?

- Mike

On 2/4/08 7:16 AM, Ti Leggett wrote:
> Around 80.
> 
> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> 
>>
>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>>> Sorry for killing the server. I'm pushing to get
>>> results to guide the selection of compounds for
>>> wet-lab testing.
>>>
>>> I had set the throttle.score.job.factor to 1 in the
>>> swift.properties file.
>>
>> Hmm. Ti, at the time of the massacre, how many did you kill?
>>
>> Mihael
>>
>>>
>>> I certainly appreciate everyone's efforts and
>>> responsiveness.
>>>
>>> Let me know what to try next, before I kill again.
>>>
>>> Cheers,
>>>
>>> Mike
>>>
>>>
>>>
>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>
>>>> So I was trying some stuff on Friday night. I guess
>>>> I've found the
>>>> strategy on when to run the tests: when nobody else
>>>> has jobs there
>>>> (besides Buzz doing gridftp tests, Ioan having some
>>>> Falkon workers
>>>> running, and the occasional Inca tests).
>>>>
>>>> In any event, the machine jumps to about 100%
>>>> utilization at around 130
>>>> jobs with pre-ws gram. So Mike, please set
>>>> throttle.score.job.factor to
>>>> 1 in swift.properties.
>>>>
>>>> There's still more work I need to do test-wise.
>>>>
>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>> Mike, You're killing tg-grid1 again. Can someone
>>>> work with Mike to get
>>>>> some swift settings that don't kill our server?
>>>>>
>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>
>>>>>> Yes, I'm submitting molecular dynamics
>>>> simulations
>>>>>> using Swift.
>>>>>>
>>>>>> Is there a default wall-time limit for jobs on
>>>> tg-uc?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>
>>>>>>> Actually, these numbers are now escalating...
>>>>>>>
>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>> average:
>>>>>>> 149.02, 123.63, 91.94
>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>> 0
>>>>>>> stopped,   0 zombie
>>>>>>>
>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>    479
>>>>>>>
>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>> tg-grid.uc.teragrid.org
>>>>>>> GRAM Authentication test successful
>>>>>>> real    0m26.134s
>>>>>>> user    0m0.090s
>>>>>>> sys     0m0.010s
>>>>>>>
>>>>>>>
>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>> wrote:
>>>>>>>
>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>> UC/ANL
>>>>>>> TG GRAM host)
>>>>>>>> became unresponsive and had to be rebooted.  I
>>>> am
>>>>>>> now seeing slow
>>>>>>>> response times from the Gatekeeper there
>>>> again.
>>>>>>> Authenticating to
>>>>>>>> the gatekeeper should only take a second or
>>>> two,
>>>>>>> but it is
>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>
>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>> GRAM Authentication test successful
>>>>>>>> real    0m16.096s
>>>>>>>> user    0m0.060s
>>>>>>>> sys     0m0.020s
>>>>>>>>
>>>>>>>> looking at the load on tg-grid, it is rather
>>>> high:
>>>>>>>>
>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>> average:
>>>>>>> 89.59, 78.69, 62.92
>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
>>>> 0
>>>>>>> stopped,   0 zombie
>>>>>>>>
>>>>>>>> And there appear to be a large number of
>>>> processes
>>>>>>> owned by kubal:
>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>   380
>>>>>>>>
>>>>>>>> I assume that Mike is using swift to do the
>>>> job
>>>>>>> submission.  Is
>>>>>>>> there some throttling of the rate at which
>>>> jobs
>>>>>>> are submitted to
>>>>>>>> the gatekeeper that could be done that would
>>>>>>> lighten this load
>>>>>>>> some?  (Or has that already been done since
>>>>>>> earlier today?)  The
>>>>>>>> current response times are not unacceptable,
>>>> but
>>>>>>> I'm hoping to
>>>>>>>> avoid having the machine grind to a halt as it
>>>> did
>>>>>>> earlier today.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> joe.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> ===================================================
>>>>>>>> joseph a.
>>>>>>>> insley
>>>>>>>
>>>>>>>> insley at mcs.anl.gov
>>>>>>>> mathematics & computer science division
>>>>>>> (630) 252-5649
>>>>>>>> argonne national laboratory
>>>>>>>      (630)
>>>>>>>> 252-5986 (fax)
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>> ===================================================
>>>>>>> joseph a. insley
>>>>>>>
>>>>>>> insley at mcs.anl.gov
>>>>>>> mathematics & computer science division
>>>> (630)
>>>>>>> 252-5649
>>>>>>> argonne national laboratory
>>>>>>>    (630)
>>>>>>> 252-5986 (fax)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>> ____________________________________________________________________________________ 
>>>
>>>>>> Be a better friend, newshound, and
>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>>
>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>>
>>>>
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>>
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>>>
>>>
>>>
>>>
>>>      
>>> ____________________________________________________________________________________ 
>>>
>>> Never miss a thing.  Make Yahoo your home page.
>>> http://www.yahoo.com/r/hs
>>>
>>
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
>