[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

joseph insley insley at mcs.anl.gov
Tue Jan 29 13:57:40 CST 2008


I was seeing Mike's jobs show up in the queue, and running on the  
backend nodes, and the processes I was seeing on tg-grid appeared to  
be gram and not some other application, so it would seem that it was  
indeed using PBS.

However, it appears to be using PRE-WS GRAM.... I still had some of  
the 'ps | grep kubal' output in my scrollback:

insley at tg-grid1:~> ps -ef | grep kubal
kubal    16981     1  0 16:41 ?        00:00:00 globus-job-manager - 
conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs - 
rdn jobmanager-pbs -machine-type unknown -publish-jobs
kubal    18390     1  0 16:42 ?        00:00:00 globus-job-manager - 
conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs - 
rdn jobmanager-pbs -machine-type unknown -publish-jobs
kubal    18891     1  0 16:43 ?        00:00:00 globus-job-manager - 
conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs - 
rdn jobmanager-pbs -machine-type unknown -publish-jobs
kubal    18917     1  0 16:43 ?

[snip]

kubal    28200 25985  0 16:50 ?        00:00:00 /usr/bin/perl /soft/ 
prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / 
tmp/gram_iwEHrc -c poll
kubal    28201 26954  1 16:50 ?        00:00:00 /usr/bin/perl /soft/ 
prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / 
tmp/gram_lQaIPe -c poll
kubal    28202 19438  1 16:50 ?        00:00:00 /usr/bin/perl /soft/ 
prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / 
tmp/gram_SPsdme -c poll


On Jan 29, 2008, at 1:38 PM, Ioan Raicu wrote:

> Can someone double check that the jobs are using PBS (and not FORK)  
> in GRAM?  If you are using FORK, then the high load is being caused  
> by the applications running on the GRAM host.  If it is PBS, then I  
> don't know, others might have more insight.
>
> Ioan
>
> Ian Foster wrote:
>> Hi,
>>
>> I've CCed Stuart Martin--I'd greatly appreciate some insights into  
>> what is causing this. I assume that you are using GRAM4 (aka WS- 
>> GRAM)?
>>
>> Ian.
>>
>> Michael Wilde wrote:
>>> [ was Re: Swift jobs on UC/ANL TG ]
>>>
>>> Hi. Im at OHare and will be flying soon.
>>> Ben or Mihael, if you are online, can you investigate?
>>>
>>> Yes, there are significant throttles turned on by default, and  
>>> the system opens those very gradually.
>>>
>>> MikeK, can you post to the swift-devel list your swift.properties  
>>> file, command line options, and your swift source code?
>>>
>>> Thanks,
>>>
>>> MikeW
>>>
>>>
>>> On 1/29/08 8:11 AM, Ti Leggett wrote:
>>>> The default walltime is 15 minutes. Are you doing fork jobs or  
>>>> pbs jobs? You shouldn't be doing fork jobs at all. Mike W, I  
>>>> thought there were throttles in place in Swift to prevent this  
>>>> type of overrun? Mike K, I'll need you to either stop these  
>>>> types of jobs until Mike W can verify throttling or only submit  
>>>> a few 10s of jobs at a time.
>>>>
>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote:
>>>>
>>>>> Yes, I'm submitting molecular dynamics simulations
>>>>> using Swift.
>>>>>
>>>>> Is there a default wall-time limit for jobs on tg-uc?
>>>>>
>>>>>
>>>>>
>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>
>>>>>> Actually, these numbers are now escalating...
>>>>>>
>>>>>> top - 17:18:54 up  2:29,  1 user,  load average:
>>>>>> 149.02, 123.63, 91.94
>>>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
>>>>>> stopped,   0 zombie
>>>>>>
>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>     479
>>>>>>
>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>> tg-grid.uc.teragrid.org
>>>>>> GRAM Authentication test successful
>>>>>> real    0m26.134s
>>>>>> user    0m0.090s
>>>>>> sys     0m0.010s
>>>>>>
>>>>>>
>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>>>>
>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>>>>> TG GRAM host)
>>>>>>> became unresponsive and had to be rebooted.  I am
>>>>>> now seeing slow
>>>>>>> response times from the Gatekeeper there again.
>>>>>> Authenticating to
>>>>>>> the gatekeeper should only take a second or two,
>>>>>> but it is
>>>>>>> periodically taking up to 16 seconds:
>>>>>>>
>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>> tg-grid.uc.teragrid.org
>>>>>>> GRAM Authentication test successful
>>>>>>> real    0m16.096s
>>>>>>> user    0m0.060s
>>>>>>> sys     0m0.020s
>>>>>>>
>>>>>>> looking at the load on tg-grid, it is rather high:
>>>>>>>
>>>>>>> top - 16:55:26 up  2:06,  1 user,  load average:
>>>>>> 89.59, 78.69, 62.92
>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
>>>>>> stopped,   0 zombie
>>>>>>>
>>>>>>> And there appear to be a large number of processes
>>>>>> owned by kubal:
>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>    380
>>>>>>>
>>>>>>> I assume that Mike is using swift to do the job
>>>>>> submission.  Is
>>>>>>> there some throttling of the rate at which jobs
>>>>>> are submitted to
>>>>>>> the gatekeeper that could be done that would
>>>>>> lighten this load
>>>>>>> some?  (Or has that already been done since
>>>>>> earlier today?)  The
>>>>>>> current response times are not unacceptable, but
>>>>>> I'm hoping to
>>>>>>> avoid having the machine grind to a halt as it did
>>>>>> earlier today.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> joe.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> ===================================================
>>>>>>> joseph a.
>>>>>>> insley
>>>>>>
>>>>>>> insley at mcs.anl.gov
>>>>>>> mathematics & computer science division
>>>>>> (630) 252-5649
>>>>>>> argonne national laboratory
>>>>>>       (630)
>>>>>>> 252-5986 (fax)
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ===================================================
>>>>>> joseph a. insley
>>>>>>
>>>>>> insley at mcs.anl.gov
>>>>>> mathematics & computer science division       (630)
>>>>>> 252-5649
>>>>>> argonne national laboratory
>>>>>>     (630)
>>>>>> 252-5986 (fax)
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       
>>>>> __________________________________________________________________ 
>>>>> __________________
>>>>> Be a better friend, newshound, and
>>>>> know-it-all with Yahoo! Mobile.  Try it now.  http:// 
>>>>> mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>
> -- 
> ==================================================
> Ioan Raicu
> Ph.D. Candidate
> ==================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ==================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS
> ==================================================
> ==================================================
>
>

===================================================
joseph a. insley                                                       
insley at mcs.anl.gov
mathematics & computer science division       (630) 252-5649
argonne national laboratory                               (630)  
252-5986 (fax)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080129/597553e8/attachment.html>


More information about the Swift-devel mailing list