[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Mihael Hategan hategan at mcs.anl.gov
Tue Jan 29 13:31:53 CST 2008


Ah, I was seeing the 16 second submission on Teraport (a cluster at UC),
right after an upgrade of sorts. I can ask more about this upgrade...

On Tue, 2008-01-29 at 13:15 -0600, Ian Foster wrote:
> Hi,
> 
> I've CCed Stuart Martin--I'd greatly appreciate some insights into what 
> is causing this. I assume that you are using GRAM4 (aka WS-GRAM)?
> 
> Ian.
> 
> Michael Wilde wrote:
> > [ was Re: Swift jobs on UC/ANL TG ]
> >
> > Hi. Im at OHare and will be flying soon.
> > Ben or Mihael, if you are online, can you investigate?
> >
> > Yes, there are significant throttles turned on by default, and the 
> > system opens those very gradually.
> >
> > MikeK, can you post to the swift-devel list your swift.properties 
> > file, command line options, and your swift source code?
> >
> > Thanks,
> >
> > MikeW
> >
> >
> > On 1/29/08 8:11 AM, Ti Leggett wrote:
> >> The default walltime is 15 minutes. Are you doing fork jobs or pbs 
> >> jobs? You shouldn't be doing fork jobs at all. Mike W, I thought 
> >> there were throttles in place in Swift to prevent this type of 
> >> overrun? Mike K, I'll need you to either stop these types of jobs 
> >> until Mike W can verify throttling or only submit a few 10s of jobs 
> >> at a time.
> >>
> >> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote:
> >>
> >>> Yes, I'm submitting molecular dynamics simulations
> >>> using Swift.
> >>>
> >>> Is there a default wall-time limit for jobs on tg-uc?
> >>>
> >>>
> >>>
> >>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>
> >>>> Actually, these numbers are now escalating...
> >>>>
> >>>> top - 17:18:54 up  2:29,  1 user,  load average:
> >>>> 149.02, 123.63, 91.94
> >>>> Tasks: 469 total,   4 running, 465 sleeping,   0
> >>>> stopped,   0 zombie
> >>>>
> >>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>     479
> >>>>
> >>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>> tg-grid.uc.teragrid.org
> >>>> GRAM Authentication test successful
> >>>> real    0m26.134s
> >>>> user    0m0.090s
> >>>> sys     0m0.010s
> >>>>
> >>>>
> >>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
> >>>>
> >>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
> >>>> TG GRAM host)
> >>>>> became unresponsive and had to be rebooted.  I am
> >>>> now seeing slow
> >>>>> response times from the Gatekeeper there again.
> >>>> Authenticating to
> >>>>> the gatekeeper should only take a second or two,
> >>>> but it is
> >>>>> periodically taking up to 16 seconds:
> >>>>>
> >>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>> tg-grid.uc.teragrid.org
> >>>>> GRAM Authentication test successful
> >>>>> real    0m16.096s
> >>>>> user    0m0.060s
> >>>>> sys     0m0.020s
> >>>>>
> >>>>> looking at the load on tg-grid, it is rather high:
> >>>>>
> >>>>> top - 16:55:26 up  2:06,  1 user,  load average:
> >>>> 89.59, 78.69, 62.92
> >>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
> >>>> stopped,   0 zombie
> >>>>>
> >>>>> And there appear to be a large number of processes
> >>>> owned by kubal:
> >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>    380
> >>>>>
> >>>>> I assume that Mike is using swift to do the job
> >>>> submission.  Is
> >>>>> there some throttling of the rate at which jobs
> >>>> are submitted to
> >>>>> the gatekeeper that could be done that would
> >>>> lighten this load
> >>>>> some?  (Or has that already been done since
> >>>> earlier today?)  The
> >>>>> current response times are not unacceptable, but
> >>>> I'm hoping to
> >>>>> avoid having the machine grind to a halt as it did
> >>>> earlier today.
> >>>>>
> >>>>> Thanks,
> >>>>> joe.
> >>>>>
> >>>>>
> >>>>>
> >>>> ===================================================
> >>>>> joseph a.
> >>>>> insley
> >>>>
> >>>>> insley at mcs.anl.gov
> >>>>> mathematics & computer science division
> >>>> (630) 252-5649
> >>>>> argonne national laboratory
> >>>>       (630)
> >>>>> 252-5986 (fax)
> >>>>>
> >>>>>
> >>>>
> >>>> ===================================================
> >>>> joseph a. insley
> >>>>
> >>>> insley at mcs.anl.gov
> >>>> mathematics & computer science division       (630)
> >>>> 252-5649
> >>>> argonne national laboratory
> >>>>     (630)
> >>>> 252-5986 (fax)
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>      
> >>> ____________________________________________________________________________________ 
> >>>
> >>> Be a better friend, newshound, and
> >>> know-it-all with Yahoo! Mobile.  Try it now.  
> >>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>
> >>
> >>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list