[Swift-devel] Re: Swift jobs on UC/ANL TG
Ian Foster
foster at mcs.anl.gov
Sun Feb 3 22:05:03 CST 2008
Mihael:
The point of my mail was to express what I think our priorities should be.
It would be useful to have a discussion of what our priorities are, and
how they differ from what I think they should be. But probably we
shouldn't do that via email.
Ian.
Mihael Hategan wrote:
> If you want to prioritize things differently, then please do so from the
> beginning instead of pointing out the priorities were wrong after a
> while. So please stop doing this. It is frustrating and it is not what I
> signed up for.
>
> Mihael
>
> On Sun, 2008-02-03 at 21:23 -0600, Ian Foster wrote:
>
>> Mihael:
>>
>> The motivation for doing the tests is so that we can provide
>> appropriate advice to Mike, our super-high-priority Swift user who we
>> want to help as much and as quickly as possible. I'm concerned that we
>> don't seem to feel any sense of urgency in doing this. I'd like to
>> emphasize that the sole reason for anyone funding work on Swift is
>> because they believe us when we say that Swift can help people make
>> more effective use of high-performance computing systems (parallel and
>> grid). Mike K. is our most engaged and committed user, and if he is
>> successful, will bring us fame and fortune (and fun, I think, to
>> provide three Fs!). It shouldn't take a week for us to get back to him
>> with information on how to run his application efficiently on TG.
>>
>> Ian.
>>
>> Mihael Hategan wrote:
>>
>>> On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote:
>>>
>>>
>>>> Mihael:
>>>>
>>>> Is there any chance you can try GRAM4, as was requested early last
>>>> week?
>>>>
>>>>
>>> For the tests, sure. That's a big part of why I'm doing them.
>>>
>>> If we're talking about the workflow that seems to be repeatedly killing
>>> tg-grid1, then Mike Kubal would be the right person to ask.
>>>
>>>
>>>
>>>> Ian.
>>>>
>>>> Mihael Hategan wrote:
>>>>
>>>>
>>>>> So I was trying some stuff on Friday night. I guess I've found the
>>>>> strategy on when to run the tests: when nobody else has jobs there
>>>>> (besides Buzz doing gridftp tests, Ioan having some Falkon workers
>>>>> running, and the occasional Inca tests).
>>>>>
>>>>> In any event, the machine jumps to about 100% utilization at around 130
>>>>> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to
>>>>> 1 in swift.properties.
>>>>>
>>>>> There's still more work I need to do test-wise.
>>>>>
>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get
>>>>>> some swift settings that don't kill our server?
>>>>>>
>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Yes, I'm submitting molecular dynamics simulations
>>>>>>> using Swift.
>>>>>>>
>>>>>>> Is there a default wall-time limit for jobs on tg-uc?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>
>>>>>>>> top - 17:18:54 up 2:29, 1 user, load average:
>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0
>>>>>>>> stopped, 0 zombie
>>>>>>>>
>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>> 479
>>>>>>>>
>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>> GRAM Authentication test successful
>>>>>>>> real 0m26.134s
>>>>>>>> user 0m0.090s
>>>>>>>> sys 0m0.010s
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> TG GRAM host)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> became unresponsive and had to be rebooted. I am
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> now seeing slow
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> response times from the Gatekeeper there again.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Authenticating to
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> the gatekeeper should only take a second or two,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> but it is
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>
>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> GRAM Authentication test successful
>>>>>>>>> real 0m16.096s
>>>>>>>>> user 0m0.060s
>>>>>>>>> sys 0m0.020s
>>>>>>>>>
>>>>>>>>> looking at the load on tg-grid, it is rather high:
>>>>>>>>>
>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load average:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> stopped, 0 zombie
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> And there appear to be a large number of processes
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> owned by kubal:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>> 380
>>>>>>>>>
>>>>>>>>> I assume that Mike is using swift to do the job
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> submission. Is
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> there some throttling of the rate at which jobs
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> are submitted to
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> lighten this load
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> some? (Or has that already been done since
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> earlier today?) The
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> current response times are not unacceptable, but
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I'm hoping to
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> avoid having the machine grind to a halt as it did
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> earlier today.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> joe.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> ===================================================
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> joseph a.
>>>>>>>>> insley
>>>>>>>>>
>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>> mathematics & computer science division
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> (630) 252-5649
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> argonne national laboratory
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> (630)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> 252-5986 (fax)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> ===================================================
>>>>>>>> joseph a. insley
>>>>>>>>
>>>>>>>> insley at mcs.anl.gov
>>>>>>>> mathematics & computer science division (630)
>>>>>>>> 252-5649
>>>>>>>> argonne national laboratory
>>>>>>>> (630)
>>>>>>>> 252-5986 (fax)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> ____________________________________________________________________________________
>>>>>>> Be a better friend, newshound, and
>>>>>>> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
More information about the Swift-devel
mailing list