[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Ian Foster foster at mcs.anl.gov
Tue Jan 29 14:01:17 CST 2008


I think that using WS-GRAM is key here--it has been created, and 
extensively tested, explicitly to address these concerns.

joseph insley wrote:
> I was seeing Mike's jobs show up in the queue, and running on the 
> backend nodes, and the processes I was seeing on tg-grid appeared to 
> be gram and not some other application, so it would seem that it was 
> indeed using PBS.   
>
> However, it appears to be using PRE-WS GRAM.... I still had some of 
> the 'ps | grep kubal' output in my scrollback:
>
> insley at tg-grid1:~> ps -ef | grep kubal        
> kubal    16981     1  0 16:41 ?        00:00:00 globus-job-manager 
> -conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs 
> -rdn jobmanager-pbs -machine-type unknown -publish-jobs
> kubal    18390     1  0 16:42 ?        00:00:00 globus-job-manager 
> -conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs 
> -rdn jobmanager-pbs -machine-type unknown -publish-jobs
> kubal    18891     1  0 16:43 ?        00:00:00 globus-job-manager 
> -conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs 
> -rdn jobmanager-pbs -machine-type unknown -publish-jobs
> kubal    18917     1  0 16:43 ?
>
> [snip]
>
> kubal    28200 25985  0 16:50 ?        00:00:00 /usr/bin/perl 
> /soft/prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs 
> -f /tmp/gram_iwEHrc -c poll
> kubal    28201 26954  1 16:50 ?        00:00:00 /usr/bin/perl 
> /soft/prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs 
> -f /tmp/gram_lQaIPe -c poll
> kubal    28202 19438  1 16:50 ?        00:00:00 /usr/bin/perl 
> /soft/prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs 
> -f /tmp/gram_SPsdme -c poll
>
>
> On Jan 29, 2008, at 1:38 PM, Ioan Raicu wrote:
>
>> Can someone double check that the jobs are using PBS (and not FORK) 
>> in GRAM?  If you are using FORK, then the high load is being caused 
>> by the applications running on the GRAM host.  If it is PBS, then I 
>> don't know, others might have more insight.
>>
>> Ioan
>>
>> Ian Foster wrote:
>>> Hi,
>>>
>>> I've CCed Stuart Martin--I'd greatly appreciate some insights into 
>>> what is causing this. I assume that you are using GRAM4 (aka WS-GRAM)?
>>>
>>> Ian.
>>>
>>> Michael Wilde wrote:
>>>> [ was Re: Swift jobs on UC/ANL TG ]
>>>>
>>>> Hi. Im at OHare and will be flying soon.
>>>> Ben or Mihael, if you are online, can you investigate?
>>>>
>>>> Yes, there are significant throttles turned on by default, and the 
>>>> system opens those very gradually.
>>>>
>>>> MikeK, can you post to the swift-devel list your swift.properties 
>>>> file, command line options, and your swift source code?
>>>>
>>>> Thanks,
>>>>
>>>> MikeW
>>>>
>>>>
>>>> On 1/29/08 8:11 AM, Ti Leggett wrote:
>>>>> The default walltime is 15 minutes. Are you doing fork jobs or pbs 
>>>>> jobs? You shouldn't be doing fork jobs at all. Mike W, I thought 
>>>>> there were throttles in place in Swift to prevent this type of 
>>>>> overrun? Mike K, I'll need you to either stop these types of jobs 
>>>>> until Mike W can verify throttling or only submit a few 10s of 
>>>>> jobs at a time.
>>>>>
>>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote:
>>>>>
>>>>>> Yes, I'm submitting molecular dynamics simulations
>>>>>> using Swift.
>>>>>>
>>>>>> Is there a default wall-time limit for jobs on tg-uc?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --- joseph insley <insley at mcs.anl.gov 
>>>>>> <mailto:insley at mcs.anl.gov>> wrote:
>>>>>>
>>>>>>> Actually, these numbers are now escalating...
>>>>>>>
>>>>>>> top - 17:18:54 up  2:29,  1 user,  load average:
>>>>>>> 149.02, 123.63, 91.94
>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
>>>>>>> stopped,   0 zombie
>>>>>>>
>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>     479
>>>>>>>
>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>> tg-grid.uc.teragrid.org
>>>>>>> GRAM Authentication test successful
>>>>>>> real    0m26.134s
>>>>>>> user    0m0.090s
>>>>>>> sys     0m0.010s
>>>>>>>
>>>>>>>
>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>>>>>
>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>>>>>> TG GRAM host)
>>>>>>>> became unresponsive and had to be rebooted.  I am
>>>>>>> now seeing slow
>>>>>>>> response times from the Gatekeeper there again.
>>>>>>> Authenticating to
>>>>>>>> the gatekeeper should only take a second or two,
>>>>>>> but it is
>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>
>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>> GRAM Authentication test successful
>>>>>>>> real    0m16.096s
>>>>>>>> user    0m0.060s
>>>>>>>> sys     0m0.020s
>>>>>>>>
>>>>>>>> looking at the load on tg-grid, it is rather high:
>>>>>>>>
>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load average:
>>>>>>> 89.59, 78.69, 62.92
>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
>>>>>>> stopped,   0 zombie
>>>>>>>>
>>>>>>>> And there appear to be a large number of processes
>>>>>>> owned by kubal:
>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>    380
>>>>>>>>
>>>>>>>> I assume that Mike is using swift to do the job
>>>>>>> submission.  Is
>>>>>>>> there some throttling of the rate at which jobs
>>>>>>> are submitted to
>>>>>>>> the gatekeeper that could be done that would
>>>>>>> lighten this load
>>>>>>>> some?  (Or has that already been done since
>>>>>>> earlier today?)  The
>>>>>>>> current response times are not unacceptable, but
>>>>>>> I'm hoping to
>>>>>>>> avoid having the machine grind to a halt as it did
>>>>>>> earlier today.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> joe.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> ===================================================
>>>>>>>> joseph a.
>>>>>>>> insley
>>>>>>>
>>>>>>>> insley at mcs.anl.gov <mailto:insley at mcs.anl.gov>
>>>>>>>> mathematics & computer science division
>>>>>>> (630) 252-5649
>>>>>>>> argonne national laboratory
>>>>>>>       (630)
>>>>>>>> 252-5986 (fax)
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ===================================================
>>>>>>> joseph a. insley
>>>>>>>
>>>>>>> insley at mcs.anl.gov <mailto:insley at mcs.anl.gov>
>>>>>>> mathematics & computer science division       (630)
>>>>>>> 252-5649
>>>>>>> argonne national laboratory
>>>>>>>     (630)
>>>>>>> 252-5986 (fax)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>      
>>>>>> ____________________________________________________________________________________ 
>>>>>> Be a better friend, newshound, and
>>>>>> know-it-all with Yahoo! Mobile.  Try it now.  
>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu <mailto:Swift-devel at ci.uchicago.edu>
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu <mailto:Swift-devel at ci.uchicago.edu>
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>
>> -- 
>> ==================================================
>> Ioan Raicu
>> Ph.D. Candidate
>> ==================================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ==================================================
>> Email: iraicu at cs.uchicago.edu <mailto:iraicu at cs.uchicago.edu>
>> Web:   http://www.cs.uchicago.edu/~iraicu 
>> <http://www.cs.uchicago.edu/%7Eiraicu>
>> http://dev.globus.org/wiki/Incubator/Falkon
>> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS
>> ==================================================
>> ==================================================
>>
>>
>
> ===================================================
>
> joseph a. insley                                                     
>  insley at mcs.anl.gov <mailto:insley at mcs.anl.gov>
>
> mathematics & computer science division       (630) 252-5649
>
> argonne national laboratory                               (630) 
> 252-5986 (fax)
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080129/024f3624/attachment.html>


More information about the Swift-devel mailing list