[Swift-devel] Re: Swift jobs on UC/ANL TG

Mihael Hategan hategan at mcs.anl.gov
Mon Feb 4 10:48:31 CST 2008


Yes, and I will. But unless we're completely dropping support for pre-ws
GRAM, we still need to do this.


On Mon, 2008-02-04 at 10:31 -0600, Ian Foster wrote:
> It would be really wonderful if someone can try GRAM4, which we believe 
> addresses this problem.
> 
> Ian.
> 
> Ti Leggett wrote:
> > Then I'd say we have very different levels of acceptable. A simple job 
> > submission test should never take longer than 5 minutes to complete 
> > and a load of 27 is not acceptable when the responsiveness of the 
> > machine is impacted. And since we're having this conversation, there 
> > is a perceived problem on our end so an adjustment to our definition 
> > of acceptable is needed.
> >
> > On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
> >
> >>
> >> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
> >>> That inca tests were timing out after 5 minutes and the load on the
> >>> machine was ~27. How are you concluding when things aren't acceptable?
> >>
> >> It's got 2 cpus. So to me an average load of under 100 and the SSH
> >> session being responsive looks fine.
> >>
> >> The fact that inca tests are timing out may be because inca has too low
> >> of a tolerance for things.
> >>
> >>>
> >>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
> >>>
> >>>> That's odd. Clearly if that's not acceptable from your perspective,
> >>>> yet
> >>>> I thought 130 are fine, there's a disconnect between what you think is
> >>>> acceptable and what I think is acceptable.
> >>>>
> >>>> What was that prompted you to conclude things are bad?
> >>>>
> >>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> >>>>> Around 80.
> >>>>>
> >>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> >>>>>
> >>>>>>
> >>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> >>>>>>> Sorry for killing the server. I'm pushing to get
> >>>>>>> results to guide the selection of compounds for
> >>>>>>> wet-lab testing.
> >>>>>>>
> >>>>>>> I had set the throttle.score.job.factor to 1 in the
> >>>>>>> swift.properties file.
> >>>>>>
> >>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
> >>>>>>
> >>>>>> Mihael
> >>>>>>
> >>>>>>>
> >>>>>>> I certainly appreciate everyone's efforts and
> >>>>>>> responsiveness.
> >>>>>>>
> >>>>>>> Let me know what to try next, before I kill again.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> Mike
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>>>>>>
> >>>>>>>> So I was trying some stuff on Friday night. I guess
> >>>>>>>> I've found the
> >>>>>>>> strategy on when to run the tests: when nobody else
> >>>>>>>> has jobs there
> >>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
> >>>>>>>> Falkon workers
> >>>>>>>> running, and the occasional Inca tests).
> >>>>>>>>
> >>>>>>>> In any event, the machine jumps to about 100%
> >>>>>>>> utilization at around 130
> >>>>>>>> jobs with pre-ws gram. So Mike, please set
> >>>>>>>> throttle.score.job.factor to
> >>>>>>>> 1 in swift.properties.
> >>>>>>>>
> >>>>>>>> There's still more work I need to do test-wise.
> >>>>>>>>
> >>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
> >>>>>>>> work with Mike to get
> >>>>>>>>> some swift settings that don't kill our server?
> >>>>>>>>>
> >>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> >>>>>>>>>
> >>>>>>>>>> Yes, I'm submitting molecular dynamics
> >>>>>>>> simulations
> >>>>>>>>>> using Swift.
> >>>>>>>>>>
> >>>>>>>>>> Is there a default wall-time limit for jobs on
> >>>>>>>> tg-uc?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Actually, these numbers are now escalating...
> >>>>>>>>>>>
> >>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
> >>>>>>>> average:
> >>>>>>>>>>> 149.02, 123.63, 91.94
> >>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> >>>>>>>> 0
> >>>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>
> >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>>  479
> >>>>>>>>>>>
> >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>> real    0m26.134s
> >>>>>>>>>>> user    0m0.090s
> >>>>>>>>>>> sys     0m0.010s
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> >>>>>>>> UC/ANL
> >>>>>>>>>>> TG GRAM host)
> >>>>>>>>>>>> became unresponsive and had to be rebooted.  I
> >>>>>>>> am
> >>>>>>>>>>> now seeing slow
> >>>>>>>>>>>> response times from the Gatekeeper there
> >>>>>>>> again.
> >>>>>>>>>>> Authenticating to
> >>>>>>>>>>>> the gatekeeper should only take a second or
> >>>>>>>> two,
> >>>>>>>>>>> but it is
> >>>>>>>>>>>> periodically taking up to 16 seconds:
> >>>>>>>>>>>>
> >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>>> real    0m16.096s
> >>>>>>>>>>>> user    0m0.060s
> >>>>>>>>>>>> sys     0m0.020s
> >>>>>>>>>>>>
> >>>>>>>>>>>> looking at the load on tg-grid, it is rather
> >>>>>>>> high:
> >>>>>>>>>>>>
> >>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> >>>>>>>> average:
> >>>>>>>>>>> 89.59, 78.69, 62.92
> >>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> >>>>>>>> 0
> >>>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>>
> >>>>>>>>>>>> And there appear to be a large number of
> >>>>>>>> processes
> >>>>>>>>>>> owned by kubal:
> >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>>> 380
> >>>>>>>>>>>>
> >>>>>>>>>>>> I assume that Mike is using swift to do the
> >>>>>>>> job
> >>>>>>>>>>> submission.  Is
> >>>>>>>>>>>> there some throttling of the rate at which
> >>>>>>>> jobs
> >>>>>>>>>>> are submitted to
> >>>>>>>>>>>> the gatekeeper that could be done that would
> >>>>>>>>>>> lighten this load
> >>>>>>>>>>>> some?  (Or has that already been done since
> >>>>>>>>>>> earlier today?)  The
> >>>>>>>>>>>> current response times are not unacceptable,
> >>>>>>>> but
> >>>>>>>>>>> I'm hoping to
> >>>>>>>>>>>> avoid having the machine grind to a halt as it
> >>>>>>>> did
> >>>>>>>>>>> earlier today.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> joe.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>> ===================================================
> >>>>>>>>>>>> joseph a.
> >>>>>>>>>>>> insley
> >>>>>>>>>>>
> >>>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>>> mathematics & computer science division
> >>>>>>>>>>> (630) 252-5649
> >>>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>>    (630)
> >>>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>> ===================================================
> >>>>>>>>>>> joseph a. insley
> >>>>>>>>>>>
> >>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>> mathematics & computer science division
> >>>>>>>> (630)
> >>>>>>>>>>> 252-5649
> >>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>>  (630)
> >>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>> ____________________________________________________________________________________ 
> >>>>>>>
> >>>>>>>>>> Be a better friend, newshound, and
> >>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> >>>>>>>>
> >>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Swift-devel mailing list
> >>>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>>
> >>>>>>>>
> >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Swift-devel mailing list
> >>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>
> >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ____________________________________________________________________________________ 
> >>>>>>>
> >>>>>>> Never miss a thing.  Make Yahoo your home page.
> >>>>>>> http://www.yahoo.com/r/hs
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> 




More information about the Swift-devel mailing list