[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Mike Kubal mikekubal at yahoo.com
Tue Jan 29 20:31:59 CST 2008


sorry, long day : )


--- Mihael Hategan <hategan at mcs.anl.gov> wrote:

> 
> On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde
> wrote:
> > MikeK, no attachment.
> > 
> > Ive narrowed the cc list, and need to read back
> through the email thread 
> >   on this to see what Mihael observed.
> 
> Let me summarize: too many gt2 gram jobs running
> concurrently = too many
> job manager processes = high load on gram node. Not
> a new issue.
> 
> > 
> > - MikeW
> > 
> > On 1/29/08 8:00 PM, Mike Kubal wrote:
> > > The attachment contains the swift script, tc
> file,
> > > sites file and swift.properties file.
> > > 
> > > I didn't provide any additional command line
> > > arguments.
> > > 
> > > MikeK
> > > 
> > > 
> > > --- Michael Wilde <wilde at mcs.anl.gov> wrote:
> > > 
> > >> [ was Re: Swift jobs on UC/ANL TG ]
> > >>
> > >> Hi. Im at OHare and will be flying soon.
> > >> Ben or Mihael, if you are online, can you
> > >> investigate?
> > >>
> > >> Yes, there are significant throttles turned on
> by
> > >> default, and the 
> > >> system opens those very gradually.
> > >>
> > >> MikeK, can you post to the swift-devel list
> your
> > >> swift.properties file, 
> > >> command line options, and your swift source
> code?
> > >>
> > >> Thanks,
> > >>
> > >> MikeW
> > >>
> > >>
> > >> On 1/29/08 8:11 AM, Ti Leggett wrote:
> > >>> The default walltime is 15 minutes. Are you
> doing
> > >> fork jobs or pbs jobs? 
> > >>> You shouldn't be doing fork jobs at all. Mike
> W, I
> > >> thought there were 
> > >>> throttles in place in Swift to prevent this
> type
> > >> of overrun? Mike K, 
> > >>> I'll need you to either stop these types of
> jobs
> > >> until Mike W can verify 
> > >>> throttling or only submit a few 10s of jobs at
> a
> > >> time.
> > >>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike
> Kubal
> > >> wrote:
> > >>>> Yes, I'm submitting molecular dynamics
> > >> simulations
> > >>>> using Swift.
> > >>>>
> > >>>> Is there a default wall-time limit for jobs
> on
> > >> tg-uc?
> > >>>>
> > >>>>
> > >>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> > >>>>
> > >>>>> Actually, these numbers are now
> escalating...
> > >>>>>
> > >>>>> top - 17:18:54 up  2:29,  1 user,  load
> average:
> > >>>>> 149.02, 123.63, 91.94
> > >>>>> Tasks: 469 total,   4 running, 465 sleeping,
>   0
> > >>>>> stopped,   0 zombie
> > >>>>>
> > >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc
> -l
> > >>>>>     479
> > >>>>>
> > >>>>> insley at tg-viz-login1:~> time globusrun -a -r
> > >>>>> tg-grid.uc.teragrid.org
> > >>>>> GRAM Authentication test successful
> > >>>>> real    0m26.134s
> > >>>>> user    0m0.090s
> > >>>>> sys     0m0.010s
> > >>>>>
> > >>>>>
> > >>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> > >> wrote:
> > >>>>>> Earlier today tg-grid.uc.teragrid.org (the
> > >> UC/ANL
> > >>>>> TG GRAM host)
> > >>>>>> became unresponsive and had to be rebooted.
>  I
> > >> am
> > >>>>> now seeing slow
> > >>>>>> response times from the Gatekeeper there
> again.
> > >>>>> Authenticating to
> > >>>>>> the gatekeeper should only take a second or
> > >> two,
> > >>>>> but it is
> > >>>>>> periodically taking up to 16 seconds:
> > >>>>>>
> > >>>>>> insley at tg-viz-login1:~> time globusrun -a
> -r
> > >>>>> tg-grid.uc.teragrid.org
> > >>>>>> GRAM Authentication test successful
> > >>>>>> real    0m16.096s
> > >>>>>> user    0m0.060s
> > >>>>>> sys     0m0.020s
> > >>>>>>
> > >>>>>> looking at the load on tg-grid, it is
> rather
> > >> high:
> > >>>>>> top - 16:55:26 up  2:06,  1 user,  load
> > >> average:
> > >>>>> 89.59, 78.69, 62.92
> > >>>>>> Tasks: 398 total,  20 running, 378
> sleeping,  
> > >> 0
> > >>>>> stopped,   0 zombie
> > >>>>>> And there appear to be a large number of
> > >> processes
> > >>>>> owned by kubal:
> > >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc
> -l
> > >>>>>>    380
> > >>>>>>
> > >>>>>> I assume that Mike is using swift to do the
> job
> > >>>>> submission.  Is
> > >>>>>> there some throttling of the rate at which
> jobs
> > >>>>> are submitted to
> > >>>>>> the gatekeeper that could be done that
> would
> > >>>>> lighten this load
> > >>>>>> some?  (Or has that already been done since
> > >>>>> earlier today?)  The
> > >>>>>> current response times are not
> unacceptable,
> > >> but
> > >>>>> I'm hoping to
> > >>>>>> avoid having the machine grind to a halt as
> it
> > >> did
> > >>>>> earlier today.
> > >>>>>> Thanks,
> > >>>>>> joe.
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>
> ===================================================
> > >>>>>> joseph a.
> > >>>>>> insley
> > >>>>>> insley at mcs.anl.gov
> > >>>>>> mathematics & computer science division
> > >>>>> (630) 252-5649
> > >>>>>> argonne national laboratory
> > >>>>>       (630)
> > >>>>>> 252-5986 (fax)
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>
> ===================================================
> > >>>>> joseph a. insley
> > >>>>>
> > >>>>> insley at mcs.anl.gov
> > >>>>> mathematics & computer science division     
> 
> > >> (630)
> > >>>>> 252-5649
> > >>>>> argonne national laboratory
> > >>>>>     (630)
> > >>>>> 252-5986 (fax)
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>>      
> 
=== message truncated ===


      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
-------------- next part --------------
A non-text attachment was scrubbed...
Name: swift_stuff.tar
Type: application/x-tar
Size: 30720 bytes
Desc: 382151955-swift_stuff.tar
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080129/38597e86/attachment.tar>


More information about the Swift-devel mailing list