[Swift-devel] Re: Swift jobs on UC/ANL TG

Mihael Hategan hategan at mcs.anl.gov
Mon Feb 4 11:27:15 CST 2008


On Mon, 2008-02-04 at 10:55 -0600, Ti Leggett wrote:
> load average is only an indication of what may be a problem. I've seen  
> a load of 10000 on a machine and it still be very responsive because  
> the processes weren't CPU hungry. So using load as a metric for  
> determining acceptability is a small piece. In this case it should be  
> the response of the gatekeeper. For instance, the inca jobs were  
> timing out getting a response from the gatekeeper after 5 minutes.  
> This is unacceptable. I would say as soon as it takes more than a  
> minute for the GK to respond, back off.

Excellent. Now we have a useable metric and value.

> 
> On Feb 4, 2008, at 10:47 AM, Mihael Hategan wrote:
> 
> >
> > On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote:
> >> Then I'd say we have very different levels of acceptable.
> >
> > Yes, that's why we're having this discussion.
> >
> >> A simple job
> >> submission test should never take longer than 5 minutes to complete
> >> and a load of 27 is not acceptable when the responsiveness of the
> >> machine is impacted. And since we're having this conversation, there
> >> is a perceived problem on our end so an adjustment to our definition
> >> of acceptable is needed.
> >
> > And we need to adjust our definition of not-acceptable. So we need to
> > meet in the middle.
> >
> > So, 25 (sustained) reasonably acceptable average load? That amounts to
> > about 13 hungry processes per cpu. Even with a 100Hz time slice, each
> > process would get 8 slices per second on average.
> >
> >>
> >> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
> >>
> >>>
> >>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
> >>>> That inca tests were timing out after 5 minutes and the load on the
> >>>> machine was ~27. How are you concluding when things aren't
> >>>> acceptable?
> >>>
> >>> It's got 2 cpus. So to me an average load of under 100 and the SSH
> >>> session being responsive looks fine.
> >>>
> >>> The fact that inca tests are timing out may be because inca has too
> >>> low
> >>> of a tolerance for things.
> >>>
> >>>>
> >>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
> >>>>
> >>>>> That's odd. Clearly if that's not acceptable from your  
> >>>>> perspective,
> >>>>> yet
> >>>>> I thought 130 are fine, there's a disconnect between what you
> >>>>> think is
> >>>>> acceptable and what I think is acceptable.
> >>>>>
> >>>>> What was that prompted you to conclude things are bad?
> >>>>>
> >>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> >>>>>> Around 80.
> >>>>>>
> >>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> >>>>>>>> Sorry for killing the server. I'm pushing to get
> >>>>>>>> results to guide the selection of compounds for
> >>>>>>>> wet-lab testing.
> >>>>>>>>
> >>>>>>>> I had set the throttle.score.job.factor to 1 in the
> >>>>>>>> swift.properties file.
> >>>>>>>
> >>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
> >>>>>>>
> >>>>>>> Mihael
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I certainly appreciate everyone's efforts and
> >>>>>>>> responsiveness.
> >>>>>>>>
> >>>>>>>> Let me know what to try next, before I kill again.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>>
> >>>>>>>> Mike
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>>>>>>>
> >>>>>>>>> So I was trying some stuff on Friday night. I guess
> >>>>>>>>> I've found the
> >>>>>>>>> strategy on when to run the tests: when nobody else
> >>>>>>>>> has jobs there
> >>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
> >>>>>>>>> Falkon workers
> >>>>>>>>> running, and the occasional Inca tests).
> >>>>>>>>>
> >>>>>>>>> In any event, the machine jumps to about 100%
> >>>>>>>>> utilization at around 130
> >>>>>>>>> jobs with pre-ws gram. So Mike, please set
> >>>>>>>>> throttle.score.job.factor to
> >>>>>>>>> 1 in swift.properties.
> >>>>>>>>>
> >>>>>>>>> There's still more work I need to do test-wise.
> >>>>>>>>>
> >>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
> >>>>>>>>> work with Mike to get
> >>>>>>>>>> some swift settings that don't kill our server?
> >>>>>>>>>>
> >>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Yes, I'm submitting molecular dynamics
> >>>>>>>>> simulations
> >>>>>>>>>>> using Swift.
> >>>>>>>>>>>
> >>>>>>>>>>> Is there a default wall-time limit for jobs on
> >>>>>>>>> tg-uc?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Actually, these numbers are now escalating...
> >>>>>>>>>>>>
> >>>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
> >>>>>>>>> average:
> >>>>>>>>>>>> 149.02, 123.63, 91.94
> >>>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> >>>>>>>>> 0
> >>>>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>>
> >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>>> 479
> >>>>>>>>>>>>
> >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>>> real    0m26.134s
> >>>>>>>>>>>> user    0m0.090s
> >>>>>>>>>>>> sys     0m0.010s
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> >>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> >>>>>>>>> UC/ANL
> >>>>>>>>>>>> TG GRAM host)
> >>>>>>>>>>>>> became unresponsive and had to be rebooted.  I
> >>>>>>>>> am
> >>>>>>>>>>>> now seeing slow
> >>>>>>>>>>>>> response times from the Gatekeeper there
> >>>>>>>>> again.
> >>>>>>>>>>>> Authenticating to
> >>>>>>>>>>>>> the gatekeeper should only take a second or
> >>>>>>>>> two,
> >>>>>>>>>>>> but it is
> >>>>>>>>>>>>> periodically taking up to 16 seconds:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>>>> real    0m16.096s
> >>>>>>>>>>>>> user    0m0.060s
> >>>>>>>>>>>>> sys     0m0.020s
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> looking at the load on tg-grid, it is rather
> >>>>>>>>> high:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> >>>>>>>>> average:
> >>>>>>>>>>>> 89.59, 78.69, 62.92
> >>>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> >>>>>>>>> 0
> >>>>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> And there appear to be a large number of
> >>>>>>>>> processes
> >>>>>>>>>>>> owned by kubal:
> >>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>>>> 380
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I assume that Mike is using swift to do the
> >>>>>>>>> job
> >>>>>>>>>>>> submission.  Is
> >>>>>>>>>>>>> there some throttling of the rate at which
> >>>>>>>>> jobs
> >>>>>>>>>>>> are submitted to
> >>>>>>>>>>>>> the gatekeeper that could be done that would
> >>>>>>>>>>>> lighten this load
> >>>>>>>>>>>>> some?  (Or has that already been done since
> >>>>>>>>>>>> earlier today?)  The
> >>>>>>>>>>>>> current response times are not unacceptable,
> >>>>>>>>> but
> >>>>>>>>>>>> I'm hoping to
> >>>>>>>>>>>>> avoid having the machine grind to a halt as it
> >>>>>>>>> did
> >>>>>>>>>>>> earlier today.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> joe.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>> ===================================================
> >>>>>>>>>>>>> joseph a.
> >>>>>>>>>>>>> insley
> >>>>>>>>>>>>
> >>>>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>>>> mathematics & computer science division
> >>>>>>>>>>>> (630) 252-5649
> >>>>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>>>   (630)
> >>>>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>> ===================================================
> >>>>>>>>>>>> joseph a. insley
> >>>>>>>>>>>>
> >>>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>>> mathematics & computer science division
> >>>>>>>>> (630)
> >>>>>>>>>>>> 252-5649
> >>>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>>> (630)
> >>>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>> ____________________________________________________________________________________
> >>>>>>>>>>> Be a better friend, newshound, and
> >>>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> >>>>>>>>>
> >>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Swift-devel mailing list
> >>>>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Swift-devel mailing list
> >>>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>>
> >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ____________________________________________________________________________________
> >>>>>>>> Never miss a thing.  Make Yahoo your home page.
> >>>>>>>> http://www.yahoo.com/r/hs
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> 




More information about the Swift-devel mailing list