[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Mihael Hategan hategan at mcs.anl.gov
Tue Jan 29 13:39:30 CST 2008


On Tue, 2008-01-29 at 13:31 -0600, Mihael Hategan wrote:
> Ah, I was seeing the 16 second submission on Teraport (a cluster at UC),
> right after an upgrade of sorts. I can ask more about this upgrade...

So teraport uses VDT. Which makes it odd. Whatever change triggers this
is both in VDT and that SDCTTTRWSC thing teragrid uses.

> 
> On Tue, 2008-01-29 at 13:15 -0600, Ian Foster wrote:
> > Hi,
> > 
> > I've CCed Stuart Martin--I'd greatly appreciate some insights into what 
> > is causing this. I assume that you are using GRAM4 (aka WS-GRAM)?
> > 
> > Ian.
> > 
> > Michael Wilde wrote:
> > > [ was Re: Swift jobs on UC/ANL TG ]
> > >
> > > Hi. Im at OHare and will be flying soon.
> > > Ben or Mihael, if you are online, can you investigate?
> > >
> > > Yes, there are significant throttles turned on by default, and the 
> > > system opens those very gradually.
> > >
> > > MikeK, can you post to the swift-devel list your swift.properties 
> > > file, command line options, and your swift source code?
> > >
> > > Thanks,
> > >
> > > MikeW
> > >
> > >
> > > On 1/29/08 8:11 AM, Ti Leggett wrote:
> > >> The default walltime is 15 minutes. Are you doing fork jobs or pbs 
> > >> jobs? You shouldn't be doing fork jobs at all. Mike W, I thought 
> > >> there were throttles in place in Swift to prevent this type of 
> > >> overrun? Mike K, I'll need you to either stop these types of jobs 
> > >> until Mike W can verify throttling or only submit a few 10s of jobs 
> > >> at a time.
> > >>
> > >> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote:
> > >>
> > >>> Yes, I'm submitting molecular dynamics simulations
> > >>> using Swift.
> > >>>
> > >>> Is there a default wall-time limit for jobs on tg-uc?
> > >>>
> > >>>
> > >>>
> > >>> --- joseph insley <insley at mcs.anl.gov> wrote:
> > >>>
> > >>>> Actually, these numbers are now escalating...
> > >>>>
> > >>>> top - 17:18:54 up  2:29,  1 user,  load average:
> > >>>> 149.02, 123.63, 91.94
> > >>>> Tasks: 469 total,   4 running, 465 sleeping,   0
> > >>>> stopped,   0 zombie
> > >>>>
> > >>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > >>>>     479
> > >>>>
> > >>>> insley at tg-viz-login1:~> time globusrun -a -r
> > >>>> tg-grid.uc.teragrid.org
> > >>>> GRAM Authentication test successful
> > >>>> real    0m26.134s
> > >>>> user    0m0.090s
> > >>>> sys     0m0.010s
> > >>>>
> > >>>>
> > >>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
> > >>>>
> > >>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
> > >>>> TG GRAM host)
> > >>>>> became unresponsive and had to be rebooted.  I am
> > >>>> now seeing slow
> > >>>>> response times from the Gatekeeper there again.
> > >>>> Authenticating to
> > >>>>> the gatekeeper should only take a second or two,
> > >>>> but it is
> > >>>>> periodically taking up to 16 seconds:
> > >>>>>
> > >>>>> insley at tg-viz-login1:~> time globusrun -a -r
> > >>>> tg-grid.uc.teragrid.org
> > >>>>> GRAM Authentication test successful
> > >>>>> real    0m16.096s
> > >>>>> user    0m0.060s
> > >>>>> sys     0m0.020s
> > >>>>>
> > >>>>> looking at the load on tg-grid, it is rather high:
> > >>>>>
> > >>>>> top - 16:55:26 up  2:06,  1 user,  load average:
> > >>>> 89.59, 78.69, 62.92
> > >>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
> > >>>> stopped,   0 zombie
> > >>>>>
> > >>>>> And there appear to be a large number of processes
> > >>>> owned by kubal:
> > >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > >>>>>    380
> > >>>>>
> > >>>>> I assume that Mike is using swift to do the job
> > >>>> submission.  Is
> > >>>>> there some throttling of the rate at which jobs
> > >>>> are submitted to
> > >>>>> the gatekeeper that could be done that would
> > >>>> lighten this load
> > >>>>> some?  (Or has that already been done since
> > >>>> earlier today?)  The
> > >>>>> current response times are not unacceptable, but
> > >>>> I'm hoping to
> > >>>>> avoid having the machine grind to a halt as it did
> > >>>> earlier today.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> joe.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>> ===================================================
> > >>>>> joseph a.
> > >>>>> insley
> > >>>>
> > >>>>> insley at mcs.anl.gov
> > >>>>> mathematics & computer science division
> > >>>> (630) 252-5649
> > >>>>> argonne national laboratory
> > >>>>       (630)
> > >>>>> 252-5986 (fax)
> > >>>>>
> > >>>>>
> > >>>>
> > >>>> ===================================================
> > >>>> joseph a. insley
> > >>>>
> > >>>> insley at mcs.anl.gov
> > >>>> mathematics & computer science division       (630)
> > >>>> 252-5649
> > >>>> argonne national laboratory
> > >>>>     (630)
> > >>>> 252-5986 (fax)
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>>      
> > >>> ____________________________________________________________________________________ 
> > >>>
> > >>> Be a better friend, newshound, and
> > >>> know-it-all with Yahoo! Mobile.  Try it now.  
> > >>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > >>>
> > >>
> > >>
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list