[Swift-devel] Re: Swift jobs on UC/ANL TG
Ti Leggett
leggett at mcs.anl.gov
Mon Feb 4 10:28:38 CST 2008
Then I'd say we have very different levels of acceptable. A simple job
submission test should never take longer than 5 minutes to complete
and a load of 27 is not acceptable when the responsiveness of the
machine is impacted. And since we're having this conversation, there
is a perceived problem on our end so an adjustment to our definition
of acceptable is needed.
On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
>
> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
>> That inca tests were timing out after 5 minutes and the load on the
>> machine was ~27. How are you concluding when things aren't
>> acceptable?
>
> It's got 2 cpus. So to me an average load of under 100 and the SSH
> session being responsive looks fine.
>
> The fact that inca tests are timing out may be because inca has too
> low
> of a tolerance for things.
>
>>
>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
>>
>>> That's odd. Clearly if that's not acceptable from your perspective,
>>> yet
>>> I thought 130 are fine, there's a disconnect between what you
>>> think is
>>> acceptable and what I think is acceptable.
>>>
>>> What was that prompted you to conclude things are bad?
>>>
>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
>>>> Around 80.
>>>>
>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
>>>>
>>>>>
>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>>>>>> Sorry for killing the server. I'm pushing to get
>>>>>> results to guide the selection of compounds for
>>>>>> wet-lab testing.
>>>>>>
>>>>>> I had set the throttle.score.job.factor to 1 in the
>>>>>> swift.properties file.
>>>>>
>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
>>>>>
>>>>> Mihael
>>>>>
>>>>>>
>>>>>> I certainly appreciate everyone's efforts and
>>>>>> responsiveness.
>>>>>>
>>>>>> Let me know what to try next, before I kill again.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>>
>>>>>>
>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>>>
>>>>>>> So I was trying some stuff on Friday night. I guess
>>>>>>> I've found the
>>>>>>> strategy on when to run the tests: when nobody else
>>>>>>> has jobs there
>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
>>>>>>> Falkon workers
>>>>>>> running, and the occasional Inca tests).
>>>>>>>
>>>>>>> In any event, the machine jumps to about 100%
>>>>>>> utilization at around 130
>>>>>>> jobs with pre-ws gram. So Mike, please set
>>>>>>> throttle.score.job.factor to
>>>>>>> 1 in swift.properties.
>>>>>>>
>>>>>>> There's still more work I need to do test-wise.
>>>>>>>
>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
>>>>>>> work with Mike to get
>>>>>>>> some swift settings that don't kill our server?
>>>>>>>>
>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>>>
>>>>>>>>> Yes, I'm submitting molecular dynamics
>>>>>>> simulations
>>>>>>>>> using Swift.
>>>>>>>>>
>>>>>>>>> Is there a default wall-time limit for jobs on
>>>>>>> tg-uc?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>>
>>>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>>>
>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load
>>>>>>> average:
>>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping,
>>>>>>> 0
>>>>>>>>>> stopped, 0 zombie
>>>>>>>>>>
>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>> 479
>>>>>>>>>>
>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>> real 0m26.134s
>>>>>>>>>> user 0m0.090s
>>>>>>>>>> sys 0m0.010s
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>>>> UC/ANL
>>>>>>>>>> TG GRAM host)
>>>>>>>>>>> became unresponsive and had to be rebooted. I
>>>>>>> am
>>>>>>>>>> now seeing slow
>>>>>>>>>>> response times from the Gatekeeper there
>>>>>>> again.
>>>>>>>>>> Authenticating to
>>>>>>>>>>> the gatekeeper should only take a second or
>>>>>>> two,
>>>>>>>>>> but it is
>>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>>
>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>> real 0m16.096s
>>>>>>>>>>> user 0m0.060s
>>>>>>>>>>> sys 0m0.020s
>>>>>>>>>>>
>>>>>>>>>>> looking at the load on tg-grid, it is rather
>>>>>>> high:
>>>>>>>>>>>
>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load
>>>>>>> average:
>>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping,
>>>>>>> 0
>>>>>>>>>> stopped, 0 zombie
>>>>>>>>>>>
>>>>>>>>>>> And there appear to be a large number of
>>>>>>> processes
>>>>>>>>>> owned by kubal:
>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>> 380
>>>>>>>>>>>
>>>>>>>>>>> I assume that Mike is using swift to do the
>>>>>>> job
>>>>>>>>>> submission. Is
>>>>>>>>>>> there some throttling of the rate at which
>>>>>>> jobs
>>>>>>>>>> are submitted to
>>>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>> lighten this load
>>>>>>>>>>> some? (Or has that already been done since
>>>>>>>>>> earlier today?) The
>>>>>>>>>>> current response times are not unacceptable,
>>>>>>> but
>>>>>>>>>> I'm hoping to
>>>>>>>>>>> avoid having the machine grind to a halt as it
>>>>>>> did
>>>>>>>>>> earlier today.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> joe.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> ===================================================
>>>>>>>>>>> joseph a.
>>>>>>>>>>> insley
>>>>>>>>>>
>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>> (630) 252-5649
>>>>>>>>>>> argonne national laboratory
>>>>>>>>>> (630)
>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>> ===================================================
>>>>>>>>>> joseph a. insley
>>>>>>>>>>
>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>> mathematics & computer science division
>>>>>>> (630)
>>>>>>>>>> 252-5649
>>>>>>>>>> argonne national laboratory
>>>>>>>>>> (630)
>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>> ____________________________________________________________________________________
>>>>>>>>> Be a better friend, newshound, and
>>>>>>>>> know-it-all with Yahoo! Mobile. Try it now.
>>>>>>>
>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>
>>>>>>>
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ____________________________________________________________________________________
>>>>>> Never miss a thing. Make Yahoo your home page.
>>>>>> http://www.yahoo.com/r/hs
>>>>>>
>>>>>
>>>>
>>>
>>
>
More information about the Swift-devel
mailing list