[Swift-devel] Re: Swift jobs on UC/ANL TG

Mon Feb 4 10:55:48 CST 2008

load average is only an indication of what may be a problem. I've seen  
a load of 10000 on a machine and it still be very responsive because  
the processes weren't CPU hungry. So using load as a metric for  
determining acceptability is a small piece. In this case it should be  
the response of the gatekeeper. For instance, the inca jobs were  
timing out getting a response from the gatekeeper after 5 minutes.  
This is unacceptable. I would say as soon as it takes more than a  
minute for the GK to respond, back off.

On Feb 4, 2008, at 10:47 AM, Mihael Hategan wrote:

>
> On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote:
>> Then I'd say we have very different levels of acceptable.
>
> Yes, that's why we're having this discussion.
>
>> A simple job
>> submission test should never take longer than 5 minutes to complete
>> and a load of 27 is not acceptable when the responsiveness of the
>> machine is impacted. And since we're having this conversation, there
>> is a perceived problem on our end so an adjustment to our definition
>> of acceptable is needed.
>
> And we need to adjust our definition of not-acceptable. So we need to
> meet in the middle.
>
> So, 25 (sustained) reasonably acceptable average load? That amounts to
> about 13 hungry processes per cpu. Even with a 100Hz time slice, each
> process would get 8 slices per second on average.
>
>>
>> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
>>
>>>
>>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
>>>> That inca tests were timing out after 5 minutes and the load on the
>>>> machine was ~27. How are you concluding when things aren't
>>>> acceptable?
>>>
>>> It's got 2 cpus. So to me an average load of under 100 and the SSH
>>> session being responsive looks fine.
>>>
>>> The fact that inca tests are timing out may be because inca has too
>>> low
>>> of a tolerance for things.
>>>
>>>>
>>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
>>>>
>>>>> That's odd. Clearly if that's not acceptable from your  
>>>>> perspective,
>>>>> yet
>>>>> I thought 130 are fine, there's a disconnect between what you
>>>>> think is
>>>>> acceptable and what I think is acceptable.
>>>>>
>>>>> What was that prompted you to conclude things are bad?
>>>>>
>>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
>>>>>> Around 80.
>>>>>>
>>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
>>>>>>
>>>>>>>
>>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>>>>>>>> Sorry for killing the server. I'm pushing to get
>>>>>>>> results to guide the selection of compounds for
>>>>>>>> wet-lab testing.
>>>>>>>>
>>>>>>>> I had set the throttle.score.job.factor to 1 in the
>>>>>>>> swift.properties file.
>>>>>>>
>>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
>>>>>>>
>>>>>>> Mihael
>>>>>>>
>>>>>>>>
>>>>>>>> I certainly appreciate everyone's efforts and
>>>>>>>> responsiveness.
>>>>>>>>
>>>>>>>> Let me know what to try next, before I kill again.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>>>>>
>>>>>>>>> So I was trying some stuff on Friday night. I guess
>>>>>>>>> I've found the
>>>>>>>>> strategy on when to run the tests: when nobody else
>>>>>>>>> has jobs there
>>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
>>>>>>>>> Falkon workers
>>>>>>>>> running, and the occasional Inca tests).
>>>>>>>>>
>>>>>>>>> In any event, the machine jumps to about 100%
>>>>>>>>> utilization at around 130
>>>>>>>>> jobs with pre-ws gram. So Mike, please set
>>>>>>>>> throttle.score.job.factor to
>>>>>>>>> 1 in swift.properties.
>>>>>>>>>
>>>>>>>>> There's still more work I need to do test-wise.
>>>>>>>>>
>>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
>>>>>>>>> work with Mike to get
>>>>>>>>>> some swift settings that don't kill our server?
>>>>>>>>>>
>>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, I'm submitting molecular dynamics
>>>>>>>>> simulations
>>>>>>>>>>> using Swift.
>>>>>>>>>>>
>>>>>>>>>>> Is there a default wall-time limit for jobs on
>>>>>>>>> tg-uc?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>>>>>
>>>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>>>>>>> average:
>>>>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>>>>>>> 0
>>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>
>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>> 479
>>>>>>>>>>>>
>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>> real    0m26.134s
>>>>>>>>>>>> user    0m0.090s
>>>>>>>>>>>> sys     0m0.010s
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>>>>>> UC/ANL
>>>>>>>>>>>> TG GRAM host)
>>>>>>>>>>>>> became unresponsive and had to be rebooted.  I
>>>>>>>>> am
>>>>>>>>>>>> now seeing slow
>>>>>>>>>>>>> response times from the Gatekeeper there
>>>>>>>>> again.
>>>>>>>>>>>> Authenticating to
>>>>>>>>>>>>> the gatekeeper should only take a second or
>>>>>>>>> two,
>>>>>>>>>>>> but it is
>>>>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>>>>
>>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>>> real    0m16.096s
>>>>>>>>>>>>> user    0m0.060s
>>>>>>>>>>>>> sys     0m0.020s
>>>>>>>>>>>>>
>>>>>>>>>>>>> looking at the load on tg-grid, it is rather
>>>>>>>>> high:
>>>>>>>>>>>>>
>>>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>>>>>>> average:
>>>>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
>>>>>>>>> 0
>>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>>
>>>>>>>>>>>>> And there appear to be a large number of
>>>>>>>>> processes
>>>>>>>>>>>> owned by kubal:
>>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>>> 380
>>>>>>>>>>>>>
>>>>>>>>>>>>> I assume that Mike is using swift to do the
>>>>>>>>> job
>>>>>>>>>>>> submission.  Is
>>>>>>>>>>>>> there some throttling of the rate at which
>>>>>>>>> jobs
>>>>>>>>>>>> are submitted to
>>>>>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>>>> lighten this load
>>>>>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>>>> earlier today?)  The
>>>>>>>>>>>>> current response times are not unacceptable,
>>>>>>>>> but
>>>>>>>>>>>> I'm hoping to
>>>>>>>>>>>>> avoid having the machine grind to a halt as it
>>>>>>>>> did
>>>>>>>>>>>> earlier today.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> joe.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> ===================================================
>>>>>>>>>>>>> joseph a.
>>>>>>>>>>>>> insley
>>>>>>>>>>>>
>>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>>>> (630) 252-5649
>>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>>   (630)
>>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> ===================================================
>>>>>>>>>>>> joseph a. insley
>>>>>>>>>>>>
>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>> (630)
>>>>>>>>>>>> 252-5649
>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>> (630)
>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>> ____________________________________________________________________________________
>>>>>>>>>>> Be a better friend, newshound, and
>>>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>>>>>>>
>>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Swift-devel mailing list
>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ____________________________________________________________________________________
>>>>>>>> Never miss a thing.  Make Yahoo your home page.
>>>>>>>> http://www.yahoo.com/r/hs
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>