[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Mihael Hategan hategan at mcs.anl.gov
Tue Jan 29 14:32:24 CST 2008


On Tue, 2008-01-29 at 14:06 -0600, Stuart Martin wrote:
> This is the classic GRAM2 scaling issue due to each job polling for  
> status to the LRM.  condor-g does all sorts of things to make GRAM2  
> scale for that scenario.  If swift is not using condor-g and not doing  
> the condor-g tricks, then I'd recommend swift to switch to using gram4.

Swift should work with gram4 as it is. One needs, however, to specify
that in the sites.xml file.

> 
> -Stu
> 
> On Jan 29, 2008, at Jan 29, 1:57 PM, joseph insley wrote:
> 
> > I was seeing Mike's jobs show up in the queue, and running on the  
> > backend nodes, and the processes I was seeing on tg-grid appeared to  
> > be gram and not some other application, so it would seem that it was  
> > indeed using PBS.
> >
> > However, it appears to be using PRE-WS GRAM.... I still had some of  
> > the 'ps | grep kubal' output in my scrollback:
> >
> > insley at tg-grid1:~> ps -ef | grep kubal
> > kubal    16981     1  0 16:41 ?        00:00:00 globus-job-manager - 
> > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs  
> > -rdn jobmanager-pbs -machine-type unknown -publish-jobs
> > kubal    18390     1  0 16:42 ?        00:00:00 globus-job-manager - 
> > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs  
> > -rdn jobmanager-pbs -machine-type unknown -publish-jobs
> > kubal    18891     1  0 16:43 ?        00:00:00 globus-job-manager - 
> > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs  
> > -rdn jobmanager-pbs -machine-type unknown -publish-jobs
> > kubal    18917     1  0 16:43 ?
> >
> > [snip]
> >
> > kubal    28200 25985  0 16:50 ?        00:00:00 /usr/bin/perl /soft/ 
> > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / 
> > tmp/gram_iwEHrc -c poll
> > kubal    28201 26954  1 16:50 ?        00:00:00 /usr/bin/perl /soft/ 
> > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / 
> > tmp/gram_lQaIPe -c poll
> > kubal    28202 19438  1 16:50 ?        00:00:00 /usr/bin/perl /soft/ 
> > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / 
> > tmp/gram_SPsdme -c poll
> >
> >
> > On Jan 29, 2008, at 1:38 PM, Ioan Raicu wrote:
> >
> >> Can someone double check that the jobs are using PBS (and not FORK)  
> >> in GRAM?  If you are using FORK, then the high load is being caused  
> >> by the applications running on the GRAM host.  If it is PBS, then I  
> >> don't know, others might have more insight.
> >>
> >> Ioan
> >>
> >> Ian Foster wrote:
> >>> Hi,
> >>>
> >>> I've CCed Stuart Martin--I'd greatly appreciate some insights into  
> >>> what is causing this. I assume that you are using GRAM4 (aka WS- 
> >>> GRAM)?
> >>>
> >>> Ian.
> >>>
> >>> Michael Wilde wrote:
> >>>> [ was Re: Swift jobs on UC/ANL TG ]
> >>>>
> >>>> Hi. Im at OHare and will be flying soon.
> >>>> Ben or Mihael, if you are online, can you investigate?
> >>>>
> >>>> Yes, there are significant throttles turned on by default, and  
> >>>> the system opens those very gradually.
> >>>>
> >>>> MikeK, can you post to the swift-devel list your swift.properties  
> >>>> file, command line options, and your swift source code?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> MikeW
> >>>>
> >>>>
> >>>> On 1/29/08 8:11 AM, Ti Leggett wrote:
> >>>>> The default walltime is 15 minutes. Are you doing fork jobs or  
> >>>>> pbs jobs? You shouldn't be doing fork jobs at all. Mike W, I  
> >>>>> thought there were throttles in place in Swift to prevent this  
> >>>>> type of overrun? Mike K, I'll need you to either stop these  
> >>>>> types of jobs until Mike W can verify throttling or only submit  
> >>>>> a few 10s of jobs at a time.
> >>>>>
> >>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote:
> >>>>>
> >>>>>> Yes, I'm submitting molecular dynamics simulations
> >>>>>> using Swift.
> >>>>>>
> >>>>>> Is there a default wall-time limit for jobs on tg-uc?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>
> >>>>>>> Actually, these numbers are now escalating...
> >>>>>>>
> >>>>>>> top - 17:18:54 up  2:29,  1 user,  load average:
> >>>>>>> 149.02, 123.63, 91.94
> >>>>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
> >>>>>>> stopped,   0 zombie
> >>>>>>>
> >>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>     479
> >>>>>>>
> >>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>> tg-grid.uc.teragrid.org
> >>>>>>> GRAM Authentication test successful
> >>>>>>> real    0m26.134s
> >>>>>>> user    0m0.090s
> >>>>>>> sys     0m0.010s
> >>>>>>>
> >>>>>>>
> >>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
> >>>>>>>
> >>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
> >>>>>>> TG GRAM host)
> >>>>>>>> became unresponsive and had to be rebooted.  I am
> >>>>>>> now seeing slow
> >>>>>>>> response times from the Gatekeeper there again.
> >>>>>>> Authenticating to
> >>>>>>>> the gatekeeper should only take a second or two,
> >>>>>>> but it is
> >>>>>>>> periodically taking up to 16 seconds:
> >>>>>>>>
> >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>> GRAM Authentication test successful
> >>>>>>>> real    0m16.096s
> >>>>>>>> user    0m0.060s
> >>>>>>>> sys     0m0.020s
> >>>>>>>>
> >>>>>>>> looking at the load on tg-grid, it is rather high:
> >>>>>>>>
> >>>>>>>> top - 16:55:26 up  2:06,  1 user,  load average:
> >>>>>>> 89.59, 78.69, 62.92
> >>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
> >>>>>>> stopped,   0 zombie
> >>>>>>>>
> >>>>>>>> And there appear to be a large number of processes
> >>>>>>> owned by kubal:
> >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>    380
> >>>>>>>>
> >>>>>>>> I assume that Mike is using swift to do the job
> >>>>>>> submission.  Is
> >>>>>>>> there some throttling of the rate at which jobs
> >>>>>>> are submitted to
> >>>>>>>> the gatekeeper that could be done that would
> >>>>>>> lighten this load
> >>>>>>>> some?  (Or has that already been done since
> >>>>>>> earlier today?)  The
> >>>>>>>> current response times are not unacceptable, but
> >>>>>>> I'm hoping to
> >>>>>>>> avoid having the machine grind to a halt as it did
> >>>>>>> earlier today.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> joe.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>> ===================================================
> >>>>>>>> joseph a.
> >>>>>>>> insley
> >>>>>>>
> >>>>>>>> insley at mcs.anl.gov
> >>>>>>>> mathematics & computer science division
> >>>>>>> (630) 252-5649
> >>>>>>>> argonne national laboratory
> >>>>>>>       (630)
> >>>>>>>> 252-5986 (fax)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> ===================================================
> >>>>>>> joseph a. insley
> >>>>>>>
> >>>>>>> insley at mcs.anl.gov
> >>>>>>> mathematics & computer science division       (630)
> >>>>>>> 252-5649
> >>>>>>> argonne national laboratory
> >>>>>>>     (630)
> >>>>>>> 252-5986 (fax)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>       
> >>>>>> ____________________________________________________________________________________
> >>>>>> Be a better friend, newshound, and
> >>>>>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>
> >>>>>
> >>>>>
> >>>> _______________________________________________
> >>>> Swift-devel mailing list
> >>>> Swift-devel at ci.uchicago.edu
> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>
> >>
> >> -- 
> >> ==================================================
> >> Ioan Raicu
> >> Ph.D. Candidate
> >> ==================================================
> >> Distributed Systems Laboratory
> >> Computer Science Department
> >> University of Chicago
> >> 1100 E. 58th Street, Ryerson Hall
> >> Chicago, IL 60637
> >> ==================================================
> >> Email: iraicu at cs.uchicago.edu
> >> Web:   http://www.cs.uchicago.edu/~iraicu
> >> http://dev.globus.org/wiki/Incubator/Falkon
> >> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS
> >> ==================================================
> >> ==================================================
> >>
> >>
> >
> > ===================================================
> > joseph a.  
> > insley                                                      insley at mcs.anl.gov
> > mathematics & computer science division       (630) 252-5649
> > argonne national laboratory                               (630)  
> > 252-5986 (fax)
> >
> >
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list