[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Ti Leggett leggett at mcs.anl.gov
Wed Jan 30 06:26:03 CST 2008


As a site admin I would rather you ramp up and not throttle down.  
Starting high and working to a lower number means you could kill the  
machine many times before you find the lower bound of what a site can  
handle. Starting slowly and ramping up means you find that lower bound  
once. From my point of view, one user consistently killing the  
resource can be turned off to prevent denial of service to all other  
users *until* they can prove they won't kill the resource. So I prefer  
the conservative.

On Jan 29, 2008, at 01/29/08 09:35 PM, Mihael Hategan wrote:

> So I've been thinking about this...
> Our current throttling parameters are the result of long discussions  
> and
> trials (pretty heated at times; the discussions that is). Obviously  
> they
> are not always appropriate. But that's not the problem. The problem, I
> think, is the lack of consensus on (and sometimes even ability to
> articulate) what is ok and what isn't.
>
> Currently our process of determining this for a site is trying to
> maximize performance while avoiding failures (this may imply high
> utilization on both client side and service side), and toning down  
> when
> the site admins complain. I'm not sure how reasonable this is for our
> users.
>
> The other strategies I've seen are:
>
> 1. Condor: Make it slow but safe. This works as long as users don't  
> have
> a frame of reference to judge how slow things are. My bosses don't  
> seem
> to like this one (nor do I for that matter), but it is a decent
> strategy: users get their job done (albeit slowly) and sites don't
> complain much.
>
> 2. LEAD: Lobby to every consequential body and urge for the services  
> to
> be sufficiently scalable to address the specific requirements of that
> project (as much as is possible given that LEAD does not have
> exclusivity). I've expressed my opinion on this one.
>
> So how do we figure out the metrics (e.g. how many total concurrent
> jobs, the rate of submissions, etc.) and how can we reach some  
> consensus
> on the numbers? Can we build some automated system that would allow
> clients and services to negotiate such parameters?
>
> Mihael
>
> On Tue, 2008-01-29 at 14:06 -0600, Stuart Martin wrote:
>> This is the classic GRAM2 scaling issue due to each job polling for
>> status to the LRM.  condor-g does all sorts of things to make GRAM2
>> scale for that scenario.  If swift is not using condor-g and not  
>> doing
>> the condor-g tricks, then I'd recommend swift to switch to using  
>> gram4.
>>
>> -Stu
>>
>> On Jan 29, 2008, at Jan 29, 1:57 PM, joseph insley wrote:
>>
>>> I was seeing Mike's jobs show up in the queue, and running on the
>>> backend nodes, and the processes I was seeing on tg-grid appeared to
>>> be gram and not some other application, so it would seem that it was
>>> indeed using PBS.
>>>
>>> However, it appears to be using PRE-WS GRAM.... I still had some of
>>> the 'ps | grep kubal' output in my scrollback:
>>>
>>> insley at tg-grid1:~> ps -ef | grep kubal
>>> kubal    16981     1  0 16:41 ?        00:00:00 globus-job-manager -
>>> conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs
>>> -rdn jobmanager-pbs -machine-type unknown -publish-jobs
>>> kubal    18390     1  0 16:42 ?        00:00:00 globus-job-manager -
>>> conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs
>>> -rdn jobmanager-pbs -machine-type unknown -publish-jobs
>>> kubal    18891     1  0 16:43 ?        00:00:00 globus-job-manager -
>>> conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs
>>> -rdn jobmanager-pbs -machine-type unknown -publish-jobs
>>> kubal    18917     1  0 16:43 ?
>>>
>>> [snip]
>>>
>>> kubal    28200 25985  0 16:50 ?        00:00:00 /usr/bin/perl /soft/
>>> prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f /
>>> tmp/gram_iwEHrc -c poll
>>> kubal    28201 26954  1 16:50 ?        00:00:00 /usr/bin/perl /soft/
>>> prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f /
>>> tmp/gram_lQaIPe -c poll
>>> kubal    28202 19438  1 16:50 ?        00:00:00 /usr/bin/perl /soft/
>>> prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f /
>>> tmp/gram_SPsdme -c poll
>>>
>>>
>>> On Jan 29, 2008, at 1:38 PM, Ioan Raicu wrote:
>>>
>>>> Can someone double check that the jobs are using PBS (and not FORK)
>>>> in GRAM?  If you are using FORK, then the high load is being caused
>>>> by the applications running on the GRAM host.  If it is PBS, then I
>>>> don't know, others might have more insight.
>>>>
>>>> Ioan
>>>>
>>>> Ian Foster wrote:
>>>>> Hi,
>>>>>
>>>>> I've CCed Stuart Martin--I'd greatly appreciate some insights into
>>>>> what is causing this. I assume that you are using GRAM4 (aka WS-
>>>>> GRAM)?
>>>>>
>>>>> Ian.
>>>>>
>>>>> Michael Wilde wrote:
>>>>>> [ was Re: Swift jobs on UC/ANL TG ]
>>>>>>
>>>>>> Hi. Im at OHare and will be flying soon.
>>>>>> Ben or Mihael, if you are online, can you investigate?
>>>>>>
>>>>>> Yes, there are significant throttles turned on by default, and
>>>>>> the system opens those very gradually.
>>>>>>
>>>>>> MikeK, can you post to the swift-devel list your swift.properties
>>>>>> file, command line options, and your swift source code?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> MikeW
>>>>>>
>>>>>>
>>>>>> On 1/29/08 8:11 AM, Ti Leggett wrote:
>>>>>>> The default walltime is 15 minutes. Are you doing fork jobs or
>>>>>>> pbs jobs? You shouldn't be doing fork jobs at all. Mike W, I
>>>>>>> thought there were throttles in place in Swift to prevent this
>>>>>>> type of overrun? Mike K, I'll need you to either stop these
>>>>>>> types of jobs until Mike W can verify throttling or only submit
>>>>>>> a few 10s of jobs at a time.
>>>>>>>
>>>>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote:
>>>>>>>
>>>>>>>> Yes, I'm submitting molecular dynamics simulations
>>>>>>>> using Swift.
>>>>>>>>
>>>>>>>> Is there a default wall-time limit for jobs on tg-uc?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>
>>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>>
>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load average:
>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
>>>>>>>>> stopped,   0 zombie
>>>>>>>>>
>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>    479
>>>>>>>>>
>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>> GRAM Authentication test successful
>>>>>>>>> real    0m26.134s
>>>>>>>>> user    0m0.090s
>>>>>>>>> sys     0m0.010s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>>>>>>>
>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>>>>>>>> TG GRAM host)
>>>>>>>>>> became unresponsive and had to be rebooted.  I am
>>>>>>>>> now seeing slow
>>>>>>>>>> response times from the Gatekeeper there again.
>>>>>>>>> Authenticating to
>>>>>>>>>> the gatekeeper should only take a second or two,
>>>>>>>>> but it is
>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>
>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>> real    0m16.096s
>>>>>>>>>> user    0m0.060s
>>>>>>>>>> sys     0m0.020s
>>>>>>>>>>
>>>>>>>>>> looking at the load on tg-grid, it is rather high:
>>>>>>>>>>
>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load average:
>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>
>>>>>>>>>> And there appear to be a large number of processes
>>>>>>>>> owned by kubal:
>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>   380
>>>>>>>>>>
>>>>>>>>>> I assume that Mike is using swift to do the job
>>>>>>>>> submission.  Is
>>>>>>>>>> there some throttling of the rate at which jobs
>>>>>>>>> are submitted to
>>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>> lighten this load
>>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>> earlier today?)  The
>>>>>>>>>> current response times are not unacceptable, but
>>>>>>>>> I'm hoping to
>>>>>>>>>> avoid having the machine grind to a halt as it did
>>>>>>>>> earlier today.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> joe.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> ===================================================
>>>>>>>>>> joseph a.
>>>>>>>>>> insley
>>>>>>>>>
>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>> mathematics & computer science division
>>>>>>>>> (630) 252-5649
>>>>>>>>>> argonne national laboratory
>>>>>>>>>      (630)
>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ===================================================
>>>>>>>>> joseph a. insley
>>>>>>>>>
>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>> mathematics & computer science division       (630)
>>>>>>>>> 252-5649
>>>>>>>>> argonne national laboratory
>>>>>>>>>    (630)
>>>>>>>>> 252-5986 (fax)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ____________________________________________________________________________________
>>>>>>>> Be a better friend, newshound, and
>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>
>>>> -- 
>>>> ==================================================
>>>> Ioan Raicu
>>>> Ph.D. Candidate
>>>> ==================================================
>>>> Distributed Systems Laboratory
>>>> Computer Science Department
>>>> University of Chicago
>>>> 1100 E. 58th Street, Ryerson Hall
>>>> Chicago, IL 60637
>>>> ==================================================
>>>> Email: iraicu at cs.uchicago.edu
>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS
>>>> ==================================================
>>>> ==================================================
>>>>
>>>>
>>>
>>> ===================================================
>>> joseph a.
>>> insley                                                      insley at mcs.anl.gov
>>> mathematics & computer science division       (630) 252-5649
>>> argonne national laboratory                               (630)
>>> 252-5986 (fax)
>>>
>>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>




More information about the Swift-devel mailing list