[Swift-devel] Re: Swift jobs on UC/ANL TG

Mihael Hategan hategan at mcs.anl.gov
Sun Feb 3 22:39:05 CST 2008


We cannot define priorities about things we don't know. This management
by crisis (i.e. every new thing is of utmost priority, and maybe some
older things that used to be of utmost priority may or may not still be
of utmost priority) doesn't seem to work well.

Add to that the implications that x didn't do things right (so that we
make it slightly personal), and you've got a recipe for things not
working well at all.

Repeat this a few times, and even the most resilient of people will
begin having second thoughts. And the reaction to things one cannot
control are not those of fight but those of flight.

Now, onto the problem. The tests are no easy thing. I need time to find
the right settings, the right ways to do it, and the right times to do
it (the process involves getting that machine close to the point of
crashing). And then some way to transform some seemingly garbage like
log files into something meaningful. So no, it's not a one day job.

In the mean time, Mike was informed about what we believe might be
better ways to make things work (throttling parameters, trying ws-gram,
local PBS).

Mihael

On Sun, 2008-02-03 at 22:05 -0600, Ian Foster wrote:
> Mihael:
> 
> The point of my mail was to express what I think our priorities should be.
> 
> It would be useful to have a discussion of what our priorities are, and 
> how they differ from what I think they should be. But probably we 
> shouldn't do that via email.
> 
> Ian.
> 
> Mihael Hategan wrote:
> > If you want to prioritize things differently, then please do so from the
> > beginning instead of pointing out the priorities were wrong after a
> > while. So please stop doing this. It is frustrating and it is not what I
> > signed up for.
> >
> > Mihael
> >
> > On Sun, 2008-02-03 at 21:23 -0600, Ian Foster wrote:
> >   
> >> Mihael:
> >>
> >> The motivation for doing the tests is so that we can provide
> >> appropriate advice to Mike, our super-high-priority Swift user who we
> >> want to help as much and as quickly as possible. I'm concerned that we
> >> don't seem to feel any sense of urgency in doing this. I'd like to
> >> emphasize that the sole reason for anyone funding work on Swift is
> >> because they believe us when we say that Swift can help people make
> >> more effective use of high-performance computing systems (parallel and
> >> grid). Mike K. is our most engaged and committed user, and if he is
> >> successful, will bring us fame and fortune (and fun, I think, to
> >> provide three Fs!). It shouldn't take a week for us to get back to him
> >> with information on how to run his application efficiently on TG.
> >>
> >> Ian.
> >>
> >> Mihael Hategan wrote: 
> >>     
> >>> On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote:
> >>>   
> >>>       
> >>>> Mihael:
> >>>>
> >>>> Is there any chance you can try GRAM4, as was requested early last
> >>>> week?
> >>>>     
> >>>>         
> >>> For the tests, sure. That's a big part of why I'm doing them.
> >>>
> >>> If we're talking about the workflow that seems to be repeatedly killing
> >>> tg-grid1, then Mike Kubal would be the right person to ask.
> >>>
> >>>   
> >>>       
> >>>> Ian.
> >>>>
> >>>> Mihael Hategan wrote: 
> >>>>     
> >>>>         
> >>>>> So I was trying some stuff on Friday night. I guess I've found the
> >>>>> strategy on when to run the tests: when nobody else has jobs there
> >>>>> (besides Buzz doing gridftp tests, Ioan having some Falkon workers
> >>>>> running, and the occasional Inca tests).
> >>>>>
> >>>>> In any event, the machine jumps to about 100% utilization at around 130
> >>>>> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to
> >>>>> 1 in swift.properties.
> >>>>>
> >>>>> There's still more work I need to do test-wise.
> >>>>>
> >>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >>>>>   
> >>>>>       
> >>>>>           
> >>>>>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get  
> >>>>>> some swift settings that don't kill our server?
> >>>>>>
> >>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> >>>>>>
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>>>> Yes, I'm submitting molecular dynamics simulations
> >>>>>>> using Swift.
> >>>>>>>
> >>>>>>> Is there a default wall-time limit for jobs on tg-uc?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>>
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >>>>>>>> Actually, these numbers are now escalating...
> >>>>>>>>
> >>>>>>>> top - 17:18:54 up  2:29,  1 user,  load average:
> >>>>>>>> 149.02, 123.63, 91.94
> >>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
> >>>>>>>> stopped,   0 zombie
> >>>>>>>>
> >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>     479
> >>>>>>>>
> >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>> GRAM Authentication test successful
> >>>>>>>> real    0m26.134s
> >>>>>>>> user    0m0.090s
> >>>>>>>> sys     0m0.010s
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
> >>>>>>>>
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> TG GRAM host)
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> became unresponsive and had to be rebooted.  I am
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> now seeing slow
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> response times from the Gatekeeper there again.
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> Authenticating to
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> the gatekeeper should only take a second or two,
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> but it is
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> periodically taking up to 16 seconds:
> >>>>>>>>>
> >>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> GRAM Authentication test successful
> >>>>>>>>> real    0m16.096s
> >>>>>>>>> user    0m0.060s
> >>>>>>>>> sys     0m0.020s
> >>>>>>>>>
> >>>>>>>>> looking at the load on tg-grid, it is rather high:
> >>>>>>>>>
> >>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load average:
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> 89.59, 78.69, 62.92
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> stopped,   0 zombie
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> And there appear to be a large number of processes
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> owned by kubal:
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>    380
> >>>>>>>>>
> >>>>>>>>> I assume that Mike is using swift to do the job
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> submission.  Is
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> there some throttling of the rate at which jobs
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> are submitted to
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> the gatekeeper that could be done that would
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> lighten this load
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> some?  (Or has that already been done since
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> earlier today?)  The
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> current response times are not unacceptable, but
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> I'm hoping to
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> avoid having the machine grind to a halt as it did
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> earlier today.
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> Thanks,
> >>>>>>>>> joe.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> ===================================================
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> joseph a.
> >>>>>>>>> insley
> >>>>>>>>>           
> >>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>> mathematics & computer science division
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> (630) 252-5649
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> argonne national laboratory
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> (630)
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> 252-5986 (fax)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> ===================================================
> >>>>>>>> joseph a. insley
> >>>>>>>>
> >>>>>>>> insley at mcs.anl.gov
> >>>>>>>> mathematics & computer science division       (630)
> >>>>>>>> 252-5649
> >>>>>>>> argonne national laboratory
> >>>>>>>>     (630)
> >>>>>>>> 252-5986 (fax)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>> ____________________________________________________________________________________
> >>>>>>> Be a better friend, newshound, and
> >>>>>>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>>
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >>>>>> _______________________________________________
> >>>>>> Swift-devel mailing list
> >>>>>> Swift-devel at ci.uchicago.edu
> >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>> _______________________________________________
> >>>>> Swift-devel mailing list
> >>>>> Swift-devel at ci.uchicago.edu
> >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>
> >>>>>   
> >>>>>       
> >>>>>           
> >>>   
> >>>       
> >
> >   
> 




More information about the Swift-devel mailing list