[Swift-devel] Re: Swift jobs on UC/ANL TG

Mihael Hategan hategan at mcs.anl.gov
Mon Feb 4 17:16:05 CST 2008


On Mon, 2008-02-04 at 16:32 -0600, Mihael Hategan wrote:
> So WS-GRAM in terms of machine load seems to work better (i.e. barely
> visible), which is to be expected. Swift does however run out of memory
> faster. Whereas I could safely (from the client side perspective) run
> 256 parallel jobs with 

... pre-WS-GRAM and...

> the default 64M of heap space, with WS-GRAM it
> dies.
> 
> I don't have an exact dependence of load vs. number of jobs yet, but
> I'll be working on that.
> 
> Mihael
> 
> On Mon, 2008-02-04 at 10:48 -0600, Mihael Hategan wrote:
> > Yes, and I will. But unless we're completely dropping support for pre-ws
> > GRAM, we still need to do this.
> > 
> > 
> > On Mon, 2008-02-04 at 10:31 -0600, Ian Foster wrote:
> > > It would be really wonderful if someone can try GRAM4, which we believe 
> > > addresses this problem.
> > > 
> > > Ian.
> > > 
> > > Ti Leggett wrote:
> > > > Then I'd say we have very different levels of acceptable. A simple job 
> > > > submission test should never take longer than 5 minutes to complete 
> > > > and a load of 27 is not acceptable when the responsiveness of the 
> > > > machine is impacted. And since we're having this conversation, there 
> > > > is a perceived problem on our end so an adjustment to our definition 
> > > > of acceptable is needed.
> > > >
> > > > On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
> > > >
> > > >>
> > > >> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
> > > >>> That inca tests were timing out after 5 minutes and the load on the
> > > >>> machine was ~27. How are you concluding when things aren't acceptable?
> > > >>
> > > >> It's got 2 cpus. So to me an average load of under 100 and the SSH
> > > >> session being responsive looks fine.
> > > >>
> > > >> The fact that inca tests are timing out may be because inca has too low
> > > >> of a tolerance for things.
> > > >>
> > > >>>
> > > >>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
> > > >>>
> > > >>>> That's odd. Clearly if that's not acceptable from your perspective,
> > > >>>> yet
> > > >>>> I thought 130 are fine, there's a disconnect between what you think is
> > > >>>> acceptable and what I think is acceptable.
> > > >>>>
> > > >>>> What was that prompted you to conclude things are bad?
> > > >>>>
> > > >>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> > > >>>>> Around 80.
> > > >>>>>
> > > >>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> > > >>>>>
> > > >>>>>>
> > > >>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> > > >>>>>>> Sorry for killing the server. I'm pushing to get
> > > >>>>>>> results to guide the selection of compounds for
> > > >>>>>>> wet-lab testing.
> > > >>>>>>>
> > > >>>>>>> I had set the throttle.score.job.factor to 1 in the
> > > >>>>>>> swift.properties file.
> > > >>>>>>
> > > >>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
> > > >>>>>>
> > > >>>>>> Mihael
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>> I certainly appreciate everyone's efforts and
> > > >>>>>>> responsiveness.
> > > >>>>>>>
> > > >>>>>>> Let me know what to try next, before I kill again.
> > > >>>>>>>
> > > >>>>>>> Cheers,
> > > >>>>>>>
> > > >>>>>>> Mike
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > >>>>>>>
> > > >>>>>>>> So I was trying some stuff on Friday night. I guess
> > > >>>>>>>> I've found the
> > > >>>>>>>> strategy on when to run the tests: when nobody else
> > > >>>>>>>> has jobs there
> > > >>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
> > > >>>>>>>> Falkon workers
> > > >>>>>>>> running, and the occasional Inca tests).
> > > >>>>>>>>
> > > >>>>>>>> In any event, the machine jumps to about 100%
> > > >>>>>>>> utilization at around 130
> > > >>>>>>>> jobs with pre-ws gram. So Mike, please set
> > > >>>>>>>> throttle.score.job.factor to
> > > >>>>>>>> 1 in swift.properties.
> > > >>>>>>>>
> > > >>>>>>>> There's still more work I need to do test-wise.
> > > >>>>>>>>
> > > >>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> > > >>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
> > > >>>>>>>> work with Mike to get
> > > >>>>>>>>> some swift settings that don't kill our server?
> > > >>>>>>>>>
> > > >>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Yes, I'm submitting molecular dynamics
> > > >>>>>>>> simulations
> > > >>>>>>>>>> using Swift.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Is there a default wall-time limit for jobs on
> > > >>>>>>>> tg-uc?
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Actually, these numbers are now escalating...
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
> > > >>>>>>>> average:
> > > >>>>>>>>>>> 149.02, 123.63, 91.94
> > > >>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> > > >>>>>>>> 0
> > > >>>>>>>>>>> stopped,   0 zombie
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > > >>>>>>>>>>>  479
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> > > >>>>>>>>>>> tg-grid.uc.teragrid.org
> > > >>>>>>>>>>> GRAM Authentication test successful
> > > >>>>>>>>>>> real    0m26.134s
> > > >>>>>>>>>>> user    0m0.090s
> > > >>>>>>>>>>> sys     0m0.010s
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> > > >>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> > > >>>>>>>> UC/ANL
> > > >>>>>>>>>>> TG GRAM host)
> > > >>>>>>>>>>>> became unresponsive and had to be rebooted.  I
> > > >>>>>>>> am
> > > >>>>>>>>>>> now seeing slow
> > > >>>>>>>>>>>> response times from the Gatekeeper there
> > > >>>>>>>> again.
> > > >>>>>>>>>>> Authenticating to
> > > >>>>>>>>>>>> the gatekeeper should only take a second or
> > > >>>>>>>> two,
> > > >>>>>>>>>>> but it is
> > > >>>>>>>>>>>> periodically taking up to 16 seconds:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> > > >>>>>>>>>>> tg-grid.uc.teragrid.org
> > > >>>>>>>>>>>> GRAM Authentication test successful
> > > >>>>>>>>>>>> real    0m16.096s
> > > >>>>>>>>>>>> user    0m0.060s
> > > >>>>>>>>>>>> sys     0m0.020s
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> looking at the load on tg-grid, it is rather
> > > >>>>>>>> high:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> > > >>>>>>>> average:
> > > >>>>>>>>>>> 89.59, 78.69, 62.92
> > > >>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> > > >>>>>>>> 0
> > > >>>>>>>>>>> stopped,   0 zombie
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> And there appear to be a large number of
> > > >>>>>>>> processes
> > > >>>>>>>>>>> owned by kubal:
> > > >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > > >>>>>>>>>>>> 380
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I assume that Mike is using swift to do the
> > > >>>>>>>> job
> > > >>>>>>>>>>> submission.  Is
> > > >>>>>>>>>>>> there some throttling of the rate at which
> > > >>>>>>>> jobs
> > > >>>>>>>>>>> are submitted to
> > > >>>>>>>>>>>> the gatekeeper that could be done that would
> > > >>>>>>>>>>> lighten this load
> > > >>>>>>>>>>>> some?  (Or has that already been done since
> > > >>>>>>>>>>> earlier today?)  The
> > > >>>>>>>>>>>> current response times are not unacceptable,
> > > >>>>>>>> but
> > > >>>>>>>>>>> I'm hoping to
> > > >>>>>>>>>>>> avoid having the machine grind to a halt as it
> > > >>>>>>>> did
> > > >>>>>>>>>>> earlier today.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>> joe.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>> ===================================================
> > > >>>>>>>>>>>> joseph a.
> > > >>>>>>>>>>>> insley
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> insley at mcs.anl.gov
> > > >>>>>>>>>>>> mathematics & computer science division
> > > >>>>>>>>>>> (630) 252-5649
> > > >>>>>>>>>>>> argonne national laboratory
> > > >>>>>>>>>>>    (630)
> > > >>>>>>>>>>>> 252-5986 (fax)
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>> ===================================================
> > > >>>>>>>>>>> joseph a. insley
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> insley at mcs.anl.gov
> > > >>>>>>>>>>> mathematics & computer science division
> > > >>>>>>>> (630)
> > > >>>>>>>>>>> 252-5649
> > > >>>>>>>>>>> argonne national laboratory
> > > >>>>>>>>>>>  (630)
> > > >>>>>>>>>>> 252-5986 (fax)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>> ____________________________________________________________________________________ 
> > > >>>>>>>
> > > >>>>>>>>>> Be a better friend, newshound, and
> > > >>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> > > >>>>>>>>
> > > >>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> _______________________________________________
> > > >>>>>>>>> Swift-devel mailing list
> > > >>>>>>>>> Swift-devel at ci.uchicago.edu
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> _______________________________________________
> > > >>>>>>>> Swift-devel mailing list
> > > >>>>>>>> Swift-devel at ci.uchicago.edu
> > > >>>>>>>>
> > > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> ____________________________________________________________________________________ 
> > > >>>>>>>
> > > >>>>>>> Never miss a thing.  Make Yahoo your home page.
> > > >>>>>>> http://www.yahoo.com/r/hs
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >
> > > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 




More information about the Swift-devel mailing list