[Swift-devel] Re: Swift jobs on UC/ANL TG

Ian Foster foster at mcs.anl.gov
Mon Feb 4 10:31:59 CST 2008


It would be really wonderful if someone can try GRAM4, which we believe 
addresses this problem.

Ian.

Ti Leggett wrote:
> Then I'd say we have very different levels of acceptable. A simple job 
> submission test should never take longer than 5 minutes to complete 
> and a load of 27 is not acceptable when the responsiveness of the 
> machine is impacted. And since we're having this conversation, there 
> is a perceived problem on our end so an adjustment to our definition 
> of acceptable is needed.
>
> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
>
>>
>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
>>> That inca tests were timing out after 5 minutes and the load on the
>>> machine was ~27. How are you concluding when things aren't acceptable?
>>
>> It's got 2 cpus. So to me an average load of under 100 and the SSH
>> session being responsive looks fine.
>>
>> The fact that inca tests are timing out may be because inca has too low
>> of a tolerance for things.
>>
>>>
>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
>>>
>>>> That's odd. Clearly if that's not acceptable from your perspective,
>>>> yet
>>>> I thought 130 are fine, there's a disconnect between what you think is
>>>> acceptable and what I think is acceptable.
>>>>
>>>> What was that prompted you to conclude things are bad?
>>>>
>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
>>>>> Around 80.
>>>>>
>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
>>>>>
>>>>>>
>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>>>>>>> Sorry for killing the server. I'm pushing to get
>>>>>>> results to guide the selection of compounds for
>>>>>>> wet-lab testing.
>>>>>>>
>>>>>>> I had set the throttle.score.job.factor to 1 in the
>>>>>>> swift.properties file.
>>>>>>
>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
>>>>>>
>>>>>> Mihael
>>>>>>
>>>>>>>
>>>>>>> I certainly appreciate everyone's efforts and
>>>>>>> responsiveness.
>>>>>>>
>>>>>>> Let me know what to try next, before I kill again.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Mike
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>>>>
>>>>>>>> So I was trying some stuff on Friday night. I guess
>>>>>>>> I've found the
>>>>>>>> strategy on when to run the tests: when nobody else
>>>>>>>> has jobs there
>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
>>>>>>>> Falkon workers
>>>>>>>> running, and the occasional Inca tests).
>>>>>>>>
>>>>>>>> In any event, the machine jumps to about 100%
>>>>>>>> utilization at around 130
>>>>>>>> jobs with pre-ws gram. So Mike, please set
>>>>>>>> throttle.score.job.factor to
>>>>>>>> 1 in swift.properties.
>>>>>>>>
>>>>>>>> There's still more work I need to do test-wise.
>>>>>>>>
>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
>>>>>>>> work with Mike to get
>>>>>>>>> some swift settings that don't kill our server?
>>>>>>>>>
>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>>>>
>>>>>>>>>> Yes, I'm submitting molecular dynamics
>>>>>>>> simulations
>>>>>>>>>> using Swift.
>>>>>>>>>>
>>>>>>>>>> Is there a default wall-time limit for jobs on
>>>>>>>> tg-uc?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>>>
>>>>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>>>>
>>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>>>>>> average:
>>>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>>>>>> 0
>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>
>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>  479
>>>>>>>>>>>
>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>> real    0m26.134s
>>>>>>>>>>> user    0m0.090s
>>>>>>>>>>> sys     0m0.010s
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>>>>> UC/ANL
>>>>>>>>>>> TG GRAM host)
>>>>>>>>>>>> became unresponsive and had to be rebooted.  I
>>>>>>>> am
>>>>>>>>>>> now seeing slow
>>>>>>>>>>>> response times from the Gatekeeper there
>>>>>>>> again.
>>>>>>>>>>> Authenticating to
>>>>>>>>>>>> the gatekeeper should only take a second or
>>>>>>>> two,
>>>>>>>>>>> but it is
>>>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>>>
>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>> real    0m16.096s
>>>>>>>>>>>> user    0m0.060s
>>>>>>>>>>>> sys     0m0.020s
>>>>>>>>>>>>
>>>>>>>>>>>> looking at the load on tg-grid, it is rather
>>>>>>>> high:
>>>>>>>>>>>>
>>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>>>>>> average:
>>>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
>>>>>>>> 0
>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>
>>>>>>>>>>>> And there appear to be a large number of
>>>>>>>> processes
>>>>>>>>>>> owned by kubal:
>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>> 380
>>>>>>>>>>>>
>>>>>>>>>>>> I assume that Mike is using swift to do the
>>>>>>>> job
>>>>>>>>>>> submission.  Is
>>>>>>>>>>>> there some throttling of the rate at which
>>>>>>>> jobs
>>>>>>>>>>> are submitted to
>>>>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>>> lighten this load
>>>>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>>> earlier today?)  The
>>>>>>>>>>>> current response times are not unacceptable,
>>>>>>>> but
>>>>>>>>>>> I'm hoping to
>>>>>>>>>>>> avoid having the machine grind to a halt as it
>>>>>>>> did
>>>>>>>>>>> earlier today.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> joe.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>> ===================================================
>>>>>>>>>>>> joseph a.
>>>>>>>>>>>> insley
>>>>>>>>>>>
>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>>> (630) 252-5649
>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>    (630)
>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>> ===================================================
>>>>>>>>>>> joseph a. insley
>>>>>>>>>>>
>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>> mathematics & computer science division
>>>>>>>> (630)
>>>>>>>>>>> 252-5649
>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>  (630)
>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>> ____________________________________________________________________________________ 
>>>>>>>
>>>>>>>>>> Be a better friend, newshound, and
>>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>>>>>>
>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Swift-devel mailing list
>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>
>>>>>>>>
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ____________________________________________________________________________________ 
>>>>>>>
>>>>>>> Never miss a thing.  Make Yahoo your home page.
>>>>>>> http://www.yahoo.com/r/hs
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>



More information about the Swift-devel mailing list