[Swift-devel] Re: Swift jobs on UC/ANL TG

Mihael Hategan hategan at mcs.anl.gov
Mon Feb 4 10:47:33 CST 2008


On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote:
> Then I'd say we have very different levels of acceptable.

Yes, that's why we're having this discussion.

>  A simple job  
> submission test should never take longer than 5 minutes to complete  
> and a load of 27 is not acceptable when the responsiveness of the  
> machine is impacted. And since we're having this conversation, there  
> is a perceived problem on our end so an adjustment to our definition  
> of acceptable is needed.

And we need to adjust our definition of not-acceptable. So we need to
meet in the middle.

So, 25 (sustained) reasonably acceptable average load? That amounts to
about 13 hungry processes per cpu. Even with a 100Hz time slice, each
process would get 8 slices per second on average.

> 
> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
> 
> >
> > On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
> >> That inca tests were timing out after 5 minutes and the load on the
> >> machine was ~27. How are you concluding when things aren't  
> >> acceptable?
> >
> > It's got 2 cpus. So to me an average load of under 100 and the SSH
> > session being responsive looks fine.
> >
> > The fact that inca tests are timing out may be because inca has too  
> > low
> > of a tolerance for things.
> >
> >>
> >> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
> >>
> >>> That's odd. Clearly if that's not acceptable from your perspective,
> >>> yet
> >>> I thought 130 are fine, there's a disconnect between what you  
> >>> think is
> >>> acceptable and what I think is acceptable.
> >>>
> >>> What was that prompted you to conclude things are bad?
> >>>
> >>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> >>>> Around 80.
> >>>>
> >>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> >>>>
> >>>>>
> >>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> >>>>>> Sorry for killing the server. I'm pushing to get
> >>>>>> results to guide the selection of compounds for
> >>>>>> wet-lab testing.
> >>>>>>
> >>>>>> I had set the throttle.score.job.factor to 1 in the
> >>>>>> swift.properties file.
> >>>>>
> >>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
> >>>>>
> >>>>> Mihael
> >>>>>
> >>>>>>
> >>>>>> I certainly appreciate everyone's efforts and
> >>>>>> responsiveness.
> >>>>>>
> >>>>>> Let me know what to try next, before I kill again.
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Mike
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>>>>>
> >>>>>>> So I was trying some stuff on Friday night. I guess
> >>>>>>> I've found the
> >>>>>>> strategy on when to run the tests: when nobody else
> >>>>>>> has jobs there
> >>>>>>> (besides Buzz doing gridftp tests, Ioan having some
> >>>>>>> Falkon workers
> >>>>>>> running, and the occasional Inca tests).
> >>>>>>>
> >>>>>>> In any event, the machine jumps to about 100%
> >>>>>>> utilization at around 130
> >>>>>>> jobs with pre-ws gram. So Mike, please set
> >>>>>>> throttle.score.job.factor to
> >>>>>>> 1 in swift.properties.
> >>>>>>>
> >>>>>>> There's still more work I need to do test-wise.
> >>>>>>>
> >>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >>>>>>>> Mike, You're killing tg-grid1 again. Can someone
> >>>>>>> work with Mike to get
> >>>>>>>> some swift settings that don't kill our server?
> >>>>>>>>
> >>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> >>>>>>>>
> >>>>>>>>> Yes, I'm submitting molecular dynamics
> >>>>>>> simulations
> >>>>>>>>> using Swift.
> >>>>>>>>>
> >>>>>>>>> Is there a default wall-time limit for jobs on
> >>>>>>> tg-uc?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>>>>
> >>>>>>>>>> Actually, these numbers are now escalating...
> >>>>>>>>>>
> >>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
> >>>>>>> average:
> >>>>>>>>>> 149.02, 123.63, 91.94
> >>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> >>>>>>> 0
> >>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>
> >>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>  479
> >>>>>>>>>>
> >>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>> real    0m26.134s
> >>>>>>>>>> user    0m0.090s
> >>>>>>>>>> sys     0m0.010s
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> >>>>>>> UC/ANL
> >>>>>>>>>> TG GRAM host)
> >>>>>>>>>>> became unresponsive and had to be rebooted.  I
> >>>>>>> am
> >>>>>>>>>> now seeing slow
> >>>>>>>>>>> response times from the Gatekeeper there
> >>>>>>> again.
> >>>>>>>>>> Authenticating to
> >>>>>>>>>>> the gatekeeper should only take a second or
> >>>>>>> two,
> >>>>>>>>>> but it is
> >>>>>>>>>>> periodically taking up to 16 seconds:
> >>>>>>>>>>>
> >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>> real    0m16.096s
> >>>>>>>>>>> user    0m0.060s
> >>>>>>>>>>> sys     0m0.020s
> >>>>>>>>>>>
> >>>>>>>>>>> looking at the load on tg-grid, it is rather
> >>>>>>> high:
> >>>>>>>>>>>
> >>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> >>>>>>> average:
> >>>>>>>>>> 89.59, 78.69, 62.92
> >>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> >>>>>>> 0
> >>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>
> >>>>>>>>>>> And there appear to be a large number of
> >>>>>>> processes
> >>>>>>>>>> owned by kubal:
> >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>> 380
> >>>>>>>>>>>
> >>>>>>>>>>> I assume that Mike is using swift to do the
> >>>>>>> job
> >>>>>>>>>> submission.  Is
> >>>>>>>>>>> there some throttling of the rate at which
> >>>>>>> jobs
> >>>>>>>>>> are submitted to
> >>>>>>>>>>> the gatekeeper that could be done that would
> >>>>>>>>>> lighten this load
> >>>>>>>>>>> some?  (Or has that already been done since
> >>>>>>>>>> earlier today?)  The
> >>>>>>>>>>> current response times are not unacceptable,
> >>>>>>> but
> >>>>>>>>>> I'm hoping to
> >>>>>>>>>>> avoid having the machine grind to a halt as it
> >>>>>>> did
> >>>>>>>>>> earlier today.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> joe.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>> ===================================================
> >>>>>>>>>>> joseph a.
> >>>>>>>>>>> insley
> >>>>>>>>>>
> >>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>> mathematics & computer science division
> >>>>>>>>>> (630) 252-5649
> >>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>    (630)
> >>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>> ===================================================
> >>>>>>>>>> joseph a. insley
> >>>>>>>>>>
> >>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>> mathematics & computer science division
> >>>>>>> (630)
> >>>>>>>>>> 252-5649
> >>>>>>>>>> argonne national laboratory
> >>>>>>>>>>  (630)
> >>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>> ____________________________________________________________________________________
> >>>>>>>>> Be a better friend, newshound, and
> >>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> >>>>>>>
> >>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Swift-devel mailing list
> >>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>
> >>>>>>>
> >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Swift-devel mailing list
> >>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>
> >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ____________________________________________________________________________________
> >>>>>> Never miss a thing.  Make Yahoo your home page.
> >>>>>> http://www.yahoo.com/r/hs
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> 




More information about the Swift-devel mailing list