[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Mihael Hategan hategan at mcs.anl.gov
Tue Jan 29 20:06:10 CST 2008


On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde wrote:
> MikeK, no attachment.
> 
> Ive narrowed the cc list, and need to read back through the email thread 
>   on this to see what Mihael observed.

Let me summarize: too many gt2 gram jobs running concurrently = too many
job manager processes = high load on gram node. Not a new issue.

> 
> - MikeW
> 
> On 1/29/08 8:00 PM, Mike Kubal wrote:
> > The attachment contains the swift script, tc file,
> > sites file and swift.properties file.
> > 
> > I didn't provide any additional command line
> > arguments.
> > 
> > MikeK
> > 
> > 
> > --- Michael Wilde <wilde at mcs.anl.gov> wrote:
> > 
> >> [ was Re: Swift jobs on UC/ANL TG ]
> >>
> >> Hi. Im at OHare and will be flying soon.
> >> Ben or Mihael, if you are online, can you
> >> investigate?
> >>
> >> Yes, there are significant throttles turned on by
> >> default, and the 
> >> system opens those very gradually.
> >>
> >> MikeK, can you post to the swift-devel list your
> >> swift.properties file, 
> >> command line options, and your swift source code?
> >>
> >> Thanks,
> >>
> >> MikeW
> >>
> >>
> >> On 1/29/08 8:11 AM, Ti Leggett wrote:
> >>> The default walltime is 15 minutes. Are you doing
> >> fork jobs or pbs jobs? 
> >>> You shouldn't be doing fork jobs at all. Mike W, I
> >> thought there were 
> >>> throttles in place in Swift to prevent this type
> >> of overrun? Mike K, 
> >>> I'll need you to either stop these types of jobs
> >> until Mike W can verify 
> >>> throttling or only submit a few 10s of jobs at a
> >> time.
> >>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal
> >> wrote:
> >>>> Yes, I'm submitting molecular dynamics
> >> simulations
> >>>> using Swift.
> >>>>
> >>>> Is there a default wall-time limit for jobs on
> >> tg-uc?
> >>>>
> >>>>
> >>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>
> >>>>> Actually, these numbers are now escalating...
> >>>>>
> >>>>> top - 17:18:54 up  2:29,  1 user,  load average:
> >>>>> 149.02, 123.63, 91.94
> >>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
> >>>>> stopped,   0 zombie
> >>>>>
> >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>     479
> >>>>>
> >>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>> tg-grid.uc.teragrid.org
> >>>>> GRAM Authentication test successful
> >>>>> real    0m26.134s
> >>>>> user    0m0.090s
> >>>>> sys     0m0.010s
> >>>>>
> >>>>>
> >>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> >> wrote:
> >>>>>> Earlier today tg-grid.uc.teragrid.org (the
> >> UC/ANL
> >>>>> TG GRAM host)
> >>>>>> became unresponsive and had to be rebooted.  I
> >> am
> >>>>> now seeing slow
> >>>>>> response times from the Gatekeeper there again.
> >>>>> Authenticating to
> >>>>>> the gatekeeper should only take a second or
> >> two,
> >>>>> but it is
> >>>>>> periodically taking up to 16 seconds:
> >>>>>>
> >>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>> tg-grid.uc.teragrid.org
> >>>>>> GRAM Authentication test successful
> >>>>>> real    0m16.096s
> >>>>>> user    0m0.060s
> >>>>>> sys     0m0.020s
> >>>>>>
> >>>>>> looking at the load on tg-grid, it is rather
> >> high:
> >>>>>> top - 16:55:26 up  2:06,  1 user,  load
> >> average:
> >>>>> 89.59, 78.69, 62.92
> >>>>>> Tasks: 398 total,  20 running, 378 sleeping,  
> >> 0
> >>>>> stopped,   0 zombie
> >>>>>> And there appear to be a large number of
> >> processes
> >>>>> owned by kubal:
> >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>    380
> >>>>>>
> >>>>>> I assume that Mike is using swift to do the job
> >>>>> submission.  Is
> >>>>>> there some throttling of the rate at which jobs
> >>>>> are submitted to
> >>>>>> the gatekeeper that could be done that would
> >>>>> lighten this load
> >>>>>> some?  (Or has that already been done since
> >>>>> earlier today?)  The
> >>>>>> current response times are not unacceptable,
> >> but
> >>>>> I'm hoping to
> >>>>>> avoid having the machine grind to a halt as it
> >> did
> >>>>> earlier today.
> >>>>>> Thanks,
> >>>>>> joe.
> >>>>>>
> >>>>>>
> >>>>>>
> >> ===================================================
> >>>>>> joseph a.
> >>>>>> insley
> >>>>>> insley at mcs.anl.gov
> >>>>>> mathematics & computer science division
> >>>>> (630) 252-5649
> >>>>>> argonne national laboratory
> >>>>>       (630)
> >>>>>> 252-5986 (fax)
> >>>>>>
> >>>>>>
> >>>>>
> >> ===================================================
> >>>>> joseph a. insley
> >>>>>
> >>>>> insley at mcs.anl.gov
> >>>>> mathematics & computer science division      
> >> (630)
> >>>>> 252-5649
> >>>>> argonne national laboratory
> >>>>>     (630)
> >>>>> 252-5986 (fax)
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>      
> >>>>
> > ____________________________________________________________________________________
> >>>> Be a better friend, newshound, and
> >>>> know-it-all with Yahoo! Mobile.  Try it now.  
> >>>>
> > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >>
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> > 
> > 
> > 
> >       ____________________________________________________________________________________
> > Never miss a thing.  Make Yahoo your home page. 
> > http://www.yahoo.com/r/hs
> > 
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list