[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Mihael Hategan hategan at mcs.anl.gov
Tue Jan 29 22:26:39 CST 2008


Gotta love the use of such fuzzy terms as "vastly better", "greatly
reduces", "roughly comparable", "virtually painless" (last one isn't
from the paper).

Well, try it out. What I remember is that it eats more memory per job on
the client side, so you probably need to:

export COG_OPTS="-Xmx512M"

... or more.

On Tue, 2008-01-29 at 21:44 -0600, Ioan Raicu wrote:
> Here is a paper from TG07, that compares GRAM2 with GRAM4.  The
> conclusion of the paper are (copied and pasted from the paper at
> http://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison.pdf):
>       * GRAM4 provides vastly better functionality than GRAM2, in
>         numerous respects.
>       * GRAM4 provides better scalability than GRAM2, in terms of the
>         number of concurrent jobs that can be sup-port. It also
>         greatly reduces load on service nodes, and permits management
>         of that load.
>       * GRAM4 performance is roughly comparable to that of GRAM2. (We
>         still need to improve sequential submission and file staging
>         performance, and we have plans for doing that, and also for
>         other performance optimizations.)
> You can draw your own conclusions once you read the paper.  I also bet
> Stu has more numbers than were reported in this paper.  From what I
> heard, GRAM2 will be optional in GT4.2, and will be phased out
> completely in GT4.4, so the upgrade to GRAM4 is inevitable.
> 
> Ioan
> 
> 
> Mihael Hategan wrote: 
> > I'm becoming confused now. Last time I spoke to Yong about WS-GRAM, it
> > was less scalable and slower (although that varied) than gt2 gram.
> > 
> > So unless I see some numbers, I personally won't believe either of the
> > statements.
> > 
> > On Tue, 2008-01-29 at 21:25 -0600, Ioan Raicu wrote:
> >   
> > > Yong and I ran most of our tests (from Swift) using WS-GRAM (aka GRAM4) 
> > > on UC/ANL TG, and I use Falkon on the same cluster using only WS-GRAM.  
> > > If I am not mistaken, all TG sites support WS-GRAM.
> > > 
> > > Ioan
> > > 
> > > Michael Wilde wrote:
> > >     
> > > > MikeK, this may be obvious but just in case:
> > > > 
> > > > On 1/29/08 8:47 PM, Mihael Hategan wrote:
> > > >       
> > > > > That and/or try using ws-gram:
> > > > > <jobmanager universe="vanilla" url="tg-grid1.uc.teragrid.org" major="4"
> > > > > minor="0" patch="0"/>
> > > > >         
> > > > (this goes in the sites.xml file)
> > > > 
> > > > Q for the group: is ws-gram supported on uc.teragrid?
> > > > 
> > > >       
> > > > > On Tue, 2008-01-29 at 20:42 -0600, Mihael Hategan wrote:
> > > > >         
> > > > > > You may want to try to lower throttle.score.job.factor from 4 to 1. 
> > > > > > That
> > > > > > will cap the number of jobs at ~100 instead of ~400.
> > > > > > 
> > > > > > Mihael
> > > > > >           
> > > > for info on setting Swift properties, see "Swift Engine Configuration" 
> > > > in the users guide at:
> > > > 
> > > > http://www.ci.uchicago.edu/swift/guides/userguide.php#properties
> > > > 
> > > > - MikeW
> > > > 
> > > >       
> > > > > > On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote:
> > > > > >           
> > > > > > > sorry, long day : )
> > > > > > > 
> > > > > > > 
> > > > > > > --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > > > > > 
> > > > > > >             
> > > > > > > > On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde
> > > > > > > > wrote:
> > > > > > > >               
> > > > > > > > > MikeK, no attachment.
> > > > > > > > > 
> > > > > > > > > Ive narrowed the cc list, and need to read back
> > > > > > > > >                 
> > > > > > > > through the email thread
> > > > > > > >               
> > > > > > > > > on this to see what Mihael observed.
> > > > > > > > >                 
> > > > > > > > Let me summarize: too many gt2 gram jobs running
> > > > > > > > concurrently = too many
> > > > > > > > job manager processes = high load on gram node. Not
> > > > > > > > a new issue.
> > > > > > > > 
> > > > > > > >               
> > > > > > > > > - MikeW
> > > > > > > > > 
> > > > > > > > > On 1/29/08 8:00 PM, Mike Kubal wrote:
> > > > > > > > >                 
> > > > > > > > > > The attachment contains the swift script, tc
> > > > > > > > > >                   
> > > > > > > > file,
> > > > > > > >               
> > > > > > > > > > sites file and swift.properties file.
> > > > > > > > > > 
> > > > > > > > > > I didn't provide any additional command line
> > > > > > > > > > arguments.
> > > > > > > > > > 
> > > > > > > > > > MikeK
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > --- Michael Wilde <wilde at mcs.anl.gov> wrote:
> > > > > > > > > > 
> > > > > > > > > >                   
> > > > > > > > > > > [ was Re: Swift jobs on UC/ANL TG ]
> > > > > > > > > > > 
> > > > > > > > > > > Hi. Im at OHare and will be flying soon.
> > > > > > > > > > > Ben or Mihael, if you are online, can you
> > > > > > > > > > > investigate?
> > > > > > > > > > > 
> > > > > > > > > > > Yes, there are significant throttles turned on
> > > > > > > > > > >                     
> > > > > > > > by
> > > > > > > >               
> > > > > > > > > > > default, and the system opens those very gradually.
> > > > > > > > > > > 
> > > > > > > > > > > MikeK, can you post to the swift-devel list
> > > > > > > > > > >                     
> > > > > > > > your
> > > > > > > >               
> > > > > > > > > > > swift.properties file, command line options, and your swift source
> > > > > > > > > > >                     
> > > > > > > > code?
> > > > > > > >               
> > > > > > > > > > > Thanks,
> > > > > > > > > > > 
> > > > > > > > > > > MikeW
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > On 1/29/08 8:11 AM, Ti Leggett wrote:
> > > > > > > > > > >                     
> > > > > > > > > > > > The default walltime is 15 minutes. Are you
> > > > > > > > > > > >                       
> > > > > > > > doing
> > > > > > > >               
> > > > > > > > > > > fork jobs or pbs jobs?
> > > > > > > > > > >                     
> > > > > > > > > > > > You shouldn't be doing fork jobs at all. Mike
> > > > > > > > > > > >                       
> > > > > > > > W, I
> > > > > > > >               
> > > > > > > > > > > thought there were
> > > > > > > > > > >                     
> > > > > > > > > > > > throttles in place in Swift to prevent this
> > > > > > > > > > > >                       
> > > > > > > > type
> > > > > > > >               
> > > > > > > > > > > of overrun? Mike K,
> > > > > > > > > > >                     
> > > > > > > > > > > > I'll need you to either stop these types of
> > > > > > > > > > > >                       
> > > > > > > > jobs
> > > > > > > >               
> > > > > > > > > > > until Mike W can verify
> > > > > > > > > > >                     
> > > > > > > > > > > > throttling or only submit a few 10s of jobs at
> > > > > > > > > > > >                       
> > > > > > > > a
> > > > > > > >               
> > > > > > > > > > > time.
> > > > > > > > > > >                     
> > > > > > > > > > > > On Jan 28, 2008, at 01/28/08 07:13 PM, Mike
> > > > > > > > > > > >                       
> > > > > > > > Kubal
> > > > > > > >               
> > > > > > > > > > > wrote:
> > > > > > > > > > >                     
> > > > > > > > > > > > > Yes, I'm submitting molecular dynamics
> > > > > > > > > > > > >                         
> > > > > > > > > > > simulations
> > > > > > > > > > >                     
> > > > > > > > > > > > > using Swift.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Is there a default wall-time limit for jobs
> > > > > > > > > > > > >                         
> > > > > > > > on
> > > > > > > >               
> > > > > > > > > > > tg-uc?
> > > > > > > > > > >                     
> > > > > > > > > > > > > --- joseph insley <insley at mcs.anl.gov> wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > >                         
> > > > > > > > > > > > > > Actually, these numbers are now
> > > > > > > > > > > > > >                           
> > > > > > > > escalating...
> > > > > > > >               
> > > > > > > > > > > > > > top - 17:18:54 up  2:29,  1 user,  load
> > > > > > > > > > > > > >                           
> > > > > > > > average:
> > > > > > > >               
> > > > > > > > > > > > > > 149.02, 123.63, 91.94
> > > > > > > > > > > > > > Tasks: 469 total,   4 running, 465 sleeping,
> > > > > > > > > > > > > >                           
> > > > > > > > 0
> > > > > > > >               
> > > > > > > > > > > > > > stopped,   0 zombie
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc
> > > > > > > > > > > > > >                           
> > > > > > > > -l
> > > > > > > >               
> > > > > > > > > > > > > > 479
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > insley at tg-viz-login1:~> time globusrun -a -r
> > > > > > > > > > > > > > tg-grid.uc.teragrid.org
> > > > > > > > > > > > > > GRAM Authentication test successful
> > > > > > > > > > > > > > real    0m26.134s
> > > > > > > > > > > > > > user    0m0.090s
> > > > > > > > > > > > > > sys     0m0.010s
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On Jan 28, 2008, at 5:15 PM, joseph insley
> > > > > > > > > > > > > >                           
> > > > > > > > > > > wrote:
> > > > > > > > > > >                     
> > > > > > > > > > > > > > > Earlier today tg-grid.uc.teragrid.org (the
> > > > > > > > > > > > > > >                             
> > > > > > > > > > > UC/ANL
> > > > > > > > > > >                     
> > > > > > > > > > > > > > TG GRAM host)
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > became unresponsive and had to be rebooted.
> > > > > > > > > > > > > > >                             
> > > > > > > > I
> > > > > > > >               
> > > > > > > > > > > am
> > > > > > > > > > >                     
> > > > > > > > > > > > > > now seeing slow
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > response times from the Gatekeeper there
> > > > > > > > > > > > > > >                             
> > > > > > > > again.
> > > > > > > >               
> > > > > > > > > > > > > > Authenticating to
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > the gatekeeper should only take a second or
> > > > > > > > > > > > > > >                             
> > > > > > > > > > > two,
> > > > > > > > > > >                     
> > > > > > > > > > > > > > but it is
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > periodically taking up to 16 seconds:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > insley at tg-viz-login1:~> time globusrun -a
> > > > > > > > > > > > > > >                             
> > > > > > > > -r
> > > > > > > >               
> > > > > > > > > > > > > > tg-grid.uc.teragrid.org
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > GRAM Authentication test successful
> > > > > > > > > > > > > > > real    0m16.096s
> > > > > > > > > > > > > > > user    0m0.060s
> > > > > > > > > > > > > > > sys     0m0.020s
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > looking at the load on tg-grid, it is
> > > > > > > > > > > > > > >                             
> > > > > > > > rather
> > > > > > > >               
> > > > > > > > > > > high:
> > > > > > > > > > >                     
> > > > > > > > > > > > > > > top - 16:55:26 up  2:06,  1 user,  load
> > > > > > > > > > > > > > >                             
> > > > > > > > > > > average:
> > > > > > > > > > >                     
> > > > > > > > > > > > > > 89.59, 78.69, 62.92
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > Tasks: 398 total,  20 running, 378
> > > > > > > > > > > > > > >                             
> > > > > > > > sleeping, 
> > > > > > > >               
> > > > > > > > > > > 0
> > > > > > > > > > >                     
> > > > > > > > > > > > > > stopped,   0 zombie
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > And there appear to be a large number of
> > > > > > > > > > > > > > >                             
> > > > > > > > > > > processes
> > > > > > > > > > >                     
> > > > > > > > > > > > > > owned by kubal:
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc
> > > > > > > > > > > > > > >                             
> > > > > > > > -l
> > > > > > > >               
> > > > > > > > > > > > > > > 380
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I assume that Mike is using swift to do the
> > > > > > > > > > > > > > >                             
> > > > > > > > job
> > > > > > > >               
> > > > > > > > > > > > > > submission.  Is
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > there some throttling of the rate at which
> > > > > > > > > > > > > > >                             
> > > > > > > > jobs
> > > > > > > >               
> > > > > > > > > > > > > > are submitted to
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > the gatekeeper that could be done that
> > > > > > > > > > > > > > >                             
> > > > > > > > would
> > > > > > > >               
> > > > > > > > > > > > > > lighten this load
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > some?  (Or has that already been done since
> > > > > > > > > > > > > > >                             
> > > > > > > > > > > > > > earlier today?)  The
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > current response times are not
> > > > > > > > > > > > > > >                             
> > > > > > > > unacceptable,
> > > > > > > >               
> > > > > > > > > > > but
> > > > > > > > > > >                     
> > > > > > > > > > > > > > I'm hoping to
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > avoid having the machine grind to a halt as
> > > > > > > > > > > > > > >                             
> > > > > > > > it
> > > > > > > >               
> > > > > > > > > > > did
> > > > > > > > > > >                     
> > > > > > > > > > > > > > earlier today.
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > joe.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >                             
> > > > > > > > ===================================================
> > > > > > > >               
> > > > > > > > > > > > > > > joseph a.
> > > > > > > > > > > > > > > insley
> > > > > > > > > > > > > > > insley at mcs.anl.gov
> > > > > > > > > > > > > > > mathematics & computer science division
> > > > > > > > > > > > > > >                             
> > > > > > > > > > > > > > (630) 252-5649
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > argonne national laboratory
> > > > > > > > > > > > > > >                             
> > > > > > > > > > > > > > (630)
> > > > > > > > > > > > > >                           
> > > > > > > > > > > > > > > 252-5986 (fax)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >                             
> > > > > > > > ===================================================
> > > > > > > >               
> > > > > > > > > > > > > > joseph a. insley
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > insley at mcs.anl.gov
> > > > > > > > > > > > > > mathematics & computer science division     
> > > > > > > > > > > > > >                           
> > > > > > > > > > > (630)
> > > > > > > > > > >                     
> > > > > > > > > > > > > > 252-5649
> > > > > > > > > > > > > > argonne national laboratory
> > > > > > > > > > > > > >     (630)
> > > > > > > > > > > > > > 252-5986 (fax)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >                           
> > > > > > > === message truncated ===
> > > > > > > 
> > > > > > > 
> > > > > > >       
> > > > > > > ____________________________________________________________________________________ 
> > > > > > > 
> > > > > > > Looking for last minute shopping deals?  Find them fast with Yahoo! 
> > > > > > > Search.  
> > > > > > > http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> > > > > > >             
> > > > > > _______________________________________________
> > > > > > Swift-devel mailing list
> > > > > > Swift-devel at ci.uchicago.edu
> > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > > 
> > > > > >           
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > 
> > > > > 
> > > > >         
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > 
> > > >       
> > 
> > 
> >   
> 
> -- 
> ==================================================
> Ioan Raicu
> Ph.D. Candidate
> ==================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ==================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS
> ==================================================
> ==================================================
> 




More information about the Swift-devel mailing list