[Swift-devel] Re: Swift jobs on UC/ANL TG

Thu Feb 7 15:34:46 CST 2008

This sounds like a good place to start.

On Feb 7, 2008, at 3:24 PM, Mihael Hategan wrote:

> Ok, so I'll change the scheduler feedback loop to aim towards a 20 s  
> max
> submission time. This should apply nicely to all providers.
>
> Any objections?
>
> On Mon, 2008-02-04 at 10:55 -0600, Ti Leggett wrote:
>> load average is only an indication of what may be a problem. I've  
>> seen
>> a load of 10000 on a machine and it still be very responsive because
>> the processes weren't CPU hungry. So using load as a metric for
>> determining acceptability is a small piece. In this case it should be
>> the response of the gatekeeper. For instance, the inca jobs were
>> timing out getting a response from the gatekeeper after 5 minutes.
>> This is unacceptable. I would say as soon as it takes more than a
>> minute for the GK to respond, back off.
>>
>> On Feb 4, 2008, at 10:47 AM, Mihael Hategan wrote:
>>
>>>
>>> On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote:
>>>> Then I'd say we have very different levels of acceptable.
>>>
>>> Yes, that's why we're having this discussion.
>>>
>>>> A simple job
>>>> submission test should never take longer than 5 minutes to complete
>>>> and a load of 27 is not acceptable when the responsiveness of the
>>>> machine is impacted. And since we're having this conversation,  
>>>> there
>>>> is a perceived problem on our end so an adjustment to our  
>>>> definition
>>>> of acceptable is needed.
>>>
>>> And we need to adjust our definition of not-acceptable. So we need  
>>> to
>>> meet in the middle.
>>>
>>> So, 25 (sustained) reasonably acceptable average load? That  
>>> amounts to
>>> about 13 hungry processes per cpu. Even with a 100Hz time slice,  
>>> each
>>> process would get 8 slices per second on average.
>>>
>>>>
>>>> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
>>>>
>>>>>
>>>>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
>>>>>> That inca tests were timing out after 5 minutes and the load on  
>>>>>> the
>>>>>> machine was ~27. How are you concluding when things aren't
>>>>>> acceptable?
>>>>>
>>>>> It's got 2 cpus. So to me an average load of under 100 and the SSH
>>>>> session being responsive looks fine.
>>>>>
>>>>> The fact that inca tests are timing out may be because inca has  
>>>>> too
>>>>> low
>>>>> of a tolerance for things.
>>>>>
>>>>>>
>>>>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
>>>>>>
>>>>>>> That's odd. Clearly if that's not acceptable from your
>>>>>>> perspective,
>>>>>>> yet
>>>>>>> I thought 130 are fine, there's a disconnect between what you
>>>>>>> think is
>>>>>>> acceptable and what I think is acceptable.
>>>>>>>
>>>>>>> What was that prompted you to conclude things are bad?
>>>>>>>
>>>>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
>>>>>>>> Around 80.
>>>>>>>>
>>>>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>>>>>>>>>> Sorry for killing the server. I'm pushing to get
>>>>>>>>>> results to guide the selection of compounds for
>>>>>>>>>> wet-lab testing.
>>>>>>>>>>
>>>>>>>>>> I had set the throttle.score.job.factor to 1 in the
>>>>>>>>>> swift.properties file.
>>>>>>>>>
>>>>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
>>>>>>>>>
>>>>>>>>> Mihael
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I certainly appreciate everyone's efforts and
>>>>>>>>>> responsiveness.
>>>>>>>>>>
>>>>>>>>>> Let me know what to try next, before I kill again.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Mike
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>>>>>>>
>>>>>>>>>>> So I was trying some stuff on Friday night. I guess
>>>>>>>>>>> I've found the
>>>>>>>>>>> strategy on when to run the tests: when nobody else
>>>>>>>>>>> has jobs there
>>>>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
>>>>>>>>>>> Falkon workers
>>>>>>>>>>> running, and the occasional Inca tests).
>>>>>>>>>>>
>>>>>>>>>>> In any event, the machine jumps to about 100%
>>>>>>>>>>> utilization at around 130
>>>>>>>>>>> jobs with pre-ws gram. So Mike, please set
>>>>>>>>>>> throttle.score.job.factor to
>>>>>>>>>>> 1 in swift.properties.
>>>>>>>>>>>
>>>>>>>>>>> There's still more work I need to do test-wise.
>>>>>>>>>>>
>>>>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
>>>>>>>>>>> work with Mike to get
>>>>>>>>>>>> some swift settings that don't kill our server?
>>>>>>>>>>>>
>>>>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I'm submitting molecular dynamics
>>>>>>>>>>> simulations
>>>>>>>>>>>>> using Swift.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a default wall-time limit for jobs on
>>>>>>>>>>> tg-uc?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>>>>>>>>> average:
>>>>>>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>>>>>>>>> 0
>>>>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>>>> 479
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>>>> real    0m26.134s
>>>>>>>>>>>>>> user    0m0.090s
>>>>>>>>>>>>>> sys     0m0.010s
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>>>>>>>> UC/ANL
>>>>>>>>>>>>>> TG GRAM host)
>>>>>>>>>>>>>>> became unresponsive and had to be rebooted.  I
>>>>>>>>>>> am
>>>>>>>>>>>>>> now seeing slow
>>>>>>>>>>>>>>> response times from the Gatekeeper there
>>>>>>>>>>> again.
>>>>>>>>>>>>>> Authenticating to
>>>>>>>>>>>>>>> the gatekeeper should only take a second or
>>>>>>>>>>> two,
>>>>>>>>>>>>>> but it is
>>>>>>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>>>>> real    0m16.096s
>>>>>>>>>>>>>>> user    0m0.060s
>>>>>>>>>>>>>>> sys     0m0.020s
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> looking at the load on tg-grid, it is rather
>>>>>>>>>>> high:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>>>>>>>>> average:
>>>>>>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
>>>>>>>>>>> 0
>>>>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And there appear to be a large number of
>>>>>>>>>>> processes
>>>>>>>>>>>>>> owned by kubal:
>>>>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>>>>> 380
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I assume that Mike is using swift to do the
>>>>>>>>>>> job
>>>>>>>>>>>>>> submission.  Is
>>>>>>>>>>>>>>> there some throttling of the rate at which
>>>>>>>>>>> jobs
>>>>>>>>>>>>>> are submitted to
>>>>>>>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>>>>>> lighten this load
>>>>>>>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>>>>>> earlier today?)  The
>>>>>>>>>>>>>>> current response times are not unacceptable,
>>>>>>>>>>> but
>>>>>>>>>>>>>> I'm hoping to
>>>>>>>>>>>>>>> avoid having the machine grind to a halt as it
>>>>>>>>>>> did
>>>>>>>>>>>>>> earlier today.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> joe.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>> ===================================================
>>>>>>>>>>>>>>> joseph a.
>>>>>>>>>>>>>>> insley
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>>>>>> (630) 252-5649
>>>>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>>>>  (630)
>>>>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>> ===================================================
>>>>>>>>>>>>>> joseph a. insley
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>>> (630)
>>>>>>>>>>>>>> 252-5649
>>>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>>>> (630)
>>>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> ____________________________________________________________________________________
>>>>>>>>>>>>> Be a better friend, newshound, and
>>>>>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>>>>>>>>>
>>>>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ____________________________________________________________________________________
>>>>>>>>>> Never miss a thing.  Make Yahoo your home page.
>>>>>>>>>> http://www.yahoo.com/r/hs
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>