[Swift-devel] Support request: Swift jobs flooding uc-teragrid?
Mihael Hategan
hategan at mcs.anl.gov
Tue Jan 29 20:47:32 CST 2008
That and/or try using ws-gram:
<jobmanager universe="vanilla" url="tg-grid1.uc.teragrid.org" major="4"
minor="0" patch="0"/>
On Tue, 2008-01-29 at 20:42 -0600, Mihael Hategan wrote:
> You may want to try to lower throttle.score.job.factor from 4 to 1. That
> will cap the number of jobs at ~100 instead of ~400.
>
> Mihael
>
> On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote:
> > sorry, long day : )
> >
> >
> > --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >
> > >
> > > On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde
> > > wrote:
> > > > MikeK, no attachment.
> > > >
> > > > Ive narrowed the cc list, and need to read back
> > > through the email thread
> > > > on this to see what Mihael observed.
> > >
> > > Let me summarize: too many gt2 gram jobs running
> > > concurrently = too many
> > > job manager processes = high load on gram node. Not
> > > a new issue.
> > >
> > > >
> > > > - MikeW
> > > >
> > > > On 1/29/08 8:00 PM, Mike Kubal wrote:
> > > > > The attachment contains the swift script, tc
> > > file,
> > > > > sites file and swift.properties file.
> > > > >
> > > > > I didn't provide any additional command line
> > > > > arguments.
> > > > >
> > > > > MikeK
> > > > >
> > > > >
> > > > > --- Michael Wilde <wilde at mcs.anl.gov> wrote:
> > > > >
> > > > >> [ was Re: Swift jobs on UC/ANL TG ]
> > > > >>
> > > > >> Hi. Im at OHare and will be flying soon.
> > > > >> Ben or Mihael, if you are online, can you
> > > > >> investigate?
> > > > >>
> > > > >> Yes, there are significant throttles turned on
> > > by
> > > > >> default, and the
> > > > >> system opens those very gradually.
> > > > >>
> > > > >> MikeK, can you post to the swift-devel list
> > > your
> > > > >> swift.properties file,
> > > > >> command line options, and your swift source
> > > code?
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> MikeW
> > > > >>
> > > > >>
> > > > >> On 1/29/08 8:11 AM, Ti Leggett wrote:
> > > > >>> The default walltime is 15 minutes. Are you
> > > doing
> > > > >> fork jobs or pbs jobs?
> > > > >>> You shouldn't be doing fork jobs at all. Mike
> > > W, I
> > > > >> thought there were
> > > > >>> throttles in place in Swift to prevent this
> > > type
> > > > >> of overrun? Mike K,
> > > > >>> I'll need you to either stop these types of
> > > jobs
> > > > >> until Mike W can verify
> > > > >>> throttling or only submit a few 10s of jobs at
> > > a
> > > > >> time.
> > > > >>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike
> > > Kubal
> > > > >> wrote:
> > > > >>>> Yes, I'm submitting molecular dynamics
> > > > >> simulations
> > > > >>>> using Swift.
> > > > >>>>
> > > > >>>> Is there a default wall-time limit for jobs
> > > on
> > > > >> tg-uc?
> > > > >>>>
> > > > >>>>
> > > > >>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> > > > >>>>
> > > > >>>>> Actually, these numbers are now
> > > escalating...
> > > > >>>>>
> > > > >>>>> top - 17:18:54 up 2:29, 1 user, load
> > > average:
> > > > >>>>> 149.02, 123.63, 91.94
> > > > >>>>> Tasks: 469 total, 4 running, 465 sleeping,
> > > 0
> > > > >>>>> stopped, 0 zombie
> > > > >>>>>
> > > > >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc
> > > -l
> > > > >>>>> 479
> > > > >>>>>
> > > > >>>>> insley at tg-viz-login1:~> time globusrun -a -r
> > > > >>>>> tg-grid.uc.teragrid.org
> > > > >>>>> GRAM Authentication test successful
> > > > >>>>> real 0m26.134s
> > > > >>>>> user 0m0.090s
> > > > >>>>> sys 0m0.010s
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> > > > >> wrote:
> > > > >>>>>> Earlier today tg-grid.uc.teragrid.org (the
> > > > >> UC/ANL
> > > > >>>>> TG GRAM host)
> > > > >>>>>> became unresponsive and had to be rebooted.
> > > I
> > > > >> am
> > > > >>>>> now seeing slow
> > > > >>>>>> response times from the Gatekeeper there
> > > again.
> > > > >>>>> Authenticating to
> > > > >>>>>> the gatekeeper should only take a second or
> > > > >> two,
> > > > >>>>> but it is
> > > > >>>>>> periodically taking up to 16 seconds:
> > > > >>>>>>
> > > > >>>>>> insley at tg-viz-login1:~> time globusrun -a
> > > -r
> > > > >>>>> tg-grid.uc.teragrid.org
> > > > >>>>>> GRAM Authentication test successful
> > > > >>>>>> real 0m16.096s
> > > > >>>>>> user 0m0.060s
> > > > >>>>>> sys 0m0.020s
> > > > >>>>>>
> > > > >>>>>> looking at the load on tg-grid, it is
> > > rather
> > > > >> high:
> > > > >>>>>> top - 16:55:26 up 2:06, 1 user, load
> > > > >> average:
> > > > >>>>> 89.59, 78.69, 62.92
> > > > >>>>>> Tasks: 398 total, 20 running, 378
> > > sleeping,
> > > > >> 0
> > > > >>>>> stopped, 0 zombie
> > > > >>>>>> And there appear to be a large number of
> > > > >> processes
> > > > >>>>> owned by kubal:
> > > > >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc
> > > -l
> > > > >>>>>> 380
> > > > >>>>>>
> > > > >>>>>> I assume that Mike is using swift to do the
> > > job
> > > > >>>>> submission. Is
> > > > >>>>>> there some throttling of the rate at which
> > > jobs
> > > > >>>>> are submitted to
> > > > >>>>>> the gatekeeper that could be done that
> > > would
> > > > >>>>> lighten this load
> > > > >>>>>> some? (Or has that already been done since
> > > > >>>>> earlier today?) The
> > > > >>>>>> current response times are not
> > > unacceptable,
> > > > >> but
> > > > >>>>> I'm hoping to
> > > > >>>>>> avoid having the machine grind to a halt as
> > > it
> > > > >> did
> > > > >>>>> earlier today.
> > > > >>>>>> Thanks,
> > > > >>>>>> joe.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>
> > > ===================================================
> > > > >>>>>> joseph a.
> > > > >>>>>> insley
> > > > >>>>>> insley at mcs.anl.gov
> > > > >>>>>> mathematics & computer science division
> > > > >>>>> (630) 252-5649
> > > > >>>>>> argonne national laboratory
> > > > >>>>> (630)
> > > > >>>>>> 252-5986 (fax)
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>
> > > ===================================================
> > > > >>>>> joseph a. insley
> > > > >>>>>
> > > > >>>>> insley at mcs.anl.gov
> > > > >>>>> mathematics & computer science division
> > >
> > > > >> (630)
> > > > >>>>> 252-5649
> > > > >>>>> argonne national laboratory
> > > > >>>>> (630)
> > > > >>>>> 252-5986 (fax)
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > >
> > === message truncated ===
> >
> >
> > ____________________________________________________________________________________
> > Looking for last minute shopping deals?
> > Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
More information about the Swift-devel
mailing list