[Swift-devel] Re: Swift jobs on UC/ANL TG

Ian Foster foster at mcs.anl.gov
Sun Feb 3 22:05:03 CST 2008


Mihael:

The point of my mail was to express what I think our priorities should be.

It would be useful to have a discussion of what our priorities are, and 
how they differ from what I think they should be. But probably we 
shouldn't do that via email.

Ian.

Mihael Hategan wrote:
> If you want to prioritize things differently, then please do so from the
> beginning instead of pointing out the priorities were wrong after a
> while. So please stop doing this. It is frustrating and it is not what I
> signed up for.
>
> Mihael
>
> On Sun, 2008-02-03 at 21:23 -0600, Ian Foster wrote:
>   
>> Mihael:
>>
>> The motivation for doing the tests is so that we can provide
>> appropriate advice to Mike, our super-high-priority Swift user who we
>> want to help as much and as quickly as possible. I'm concerned that we
>> don't seem to feel any sense of urgency in doing this. I'd like to
>> emphasize that the sole reason for anyone funding work on Swift is
>> because they believe us when we say that Swift can help people make
>> more effective use of high-performance computing systems (parallel and
>> grid). Mike K. is our most engaged and committed user, and if he is
>> successful, will bring us fame and fortune (and fun, I think, to
>> provide three Fs!). It shouldn't take a week for us to get back to him
>> with information on how to run his application efficiently on TG.
>>
>> Ian.
>>
>> Mihael Hategan wrote: 
>>     
>>> On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote:
>>>   
>>>       
>>>> Mihael:
>>>>
>>>> Is there any chance you can try GRAM4, as was requested early last
>>>> week?
>>>>     
>>>>         
>>> For the tests, sure. That's a big part of why I'm doing them.
>>>
>>> If we're talking about the workflow that seems to be repeatedly killing
>>> tg-grid1, then Mike Kubal would be the right person to ask.
>>>
>>>   
>>>       
>>>> Ian.
>>>>
>>>> Mihael Hategan wrote: 
>>>>     
>>>>         
>>>>> So I was trying some stuff on Friday night. I guess I've found the
>>>>> strategy on when to run the tests: when nobody else has jobs there
>>>>> (besides Buzz doing gridftp tests, Ioan having some Falkon workers
>>>>> running, and the occasional Inca tests).
>>>>>
>>>>> In any event, the machine jumps to about 100% utilization at around 130
>>>>> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to
>>>>> 1 in swift.properties.
>>>>>
>>>>> There's still more work I need to do test-wise.
>>>>>
>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get  
>>>>>> some swift settings that don't kill our server?
>>>>>>
>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> Yes, I'm submitting molecular dynamics simulations
>>>>>>> using Swift.
>>>>>>>
>>>>>>> Is there a default wall-time limit for jobs on tg-uc?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>
>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load average:
>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
>>>>>>>> stopped,   0 zombie
>>>>>>>>
>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>     479
>>>>>>>>
>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>> GRAM Authentication test successful
>>>>>>>> real    0m26.134s
>>>>>>>> user    0m0.090s
>>>>>>>> sys     0m0.010s
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>>>>>>
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> TG GRAM host)
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> became unresponsive and had to be rebooted.  I am
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> now seeing slow
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> response times from the Gatekeeper there again.
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> Authenticating to
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> the gatekeeper should only take a second or two,
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> but it is
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>
>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> GRAM Authentication test successful
>>>>>>>>> real    0m16.096s
>>>>>>>>> user    0m0.060s
>>>>>>>>> sys     0m0.020s
>>>>>>>>>
>>>>>>>>> looking at the load on tg-grid, it is rather high:
>>>>>>>>>
>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load average:
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> stopped,   0 zombie
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> And there appear to be a large number of processes
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> owned by kubal:
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>    380
>>>>>>>>>
>>>>>>>>> I assume that Mike is using swift to do the job
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> submission.  Is
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> there some throttling of the rate at which jobs
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> are submitted to
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> lighten this load
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> earlier today?)  The
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> current response times are not unacceptable, but
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> I'm hoping to
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> avoid having the machine grind to a halt as it did
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> earlier today.
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> Thanks,
>>>>>>>>> joe.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> ===================================================
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> joseph a.
>>>>>>>>> insley
>>>>>>>>>           
>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>> mathematics & computer science division
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> (630) 252-5649
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> argonne national laboratory
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> (630)
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> 252-5986 (fax)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> ===================================================
>>>>>>>> joseph a. insley
>>>>>>>>
>>>>>>>> insley at mcs.anl.gov
>>>>>>>> mathematics & computer science division       (630)
>>>>>>>> 252-5649
>>>>>>>> argonne national laboratory
>>>>>>>>     (630)
>>>>>>>> 252-5986 (fax)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>> ____________________________________________________________________________________
>>>>>>> Be a better friend, newshound, and
>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>   
>>>       
>
>   



More information about the Swift-devel mailing list