[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Ioan Raicu iraicu at cs.uchicago.edu
Tue Jan 29 21:44:10 CST 2008


Here is a paper from TG07, that compares GRAM2 with GRAM4.  The 
conclusion of the paper are (copied and pasted from the paper at 
http://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison.pdf):

    * GRAM4 provides vastly better functionality than GRAM2, in numerous
      respects.
    * GRAM4 provides better scalability than GRAM2, in terms of the
      number of concurrent jobs that can be sup-port. It also greatly
      reduces load on service nodes, and permits management of that load.
    * GRAM4 performance is roughly comparable to that of GRAM2. (We
      still need to improve sequential submission and file staging
      performance, and we have plans for doing that, and also for other
      performance optimizations.)

You can draw your own conclusions once you read the paper.  I also bet 
Stu has more numbers than were reported in this paper.  From what I 
heard, GRAM2 will be optional in GT4.2, and will be phased out 
completely in GT4.4, so the upgrade to GRAM4 is inevitable.

Ioan


Mihael Hategan wrote:
> I'm becoming confused now. Last time I spoke to Yong about WS-GRAM, it
> was less scalable and slower (although that varied) than gt2 gram.
>
> So unless I see some numbers, I personally won't believe either of the
> statements.
>
> On Tue, 2008-01-29 at 21:25 -0600, Ioan Raicu wrote:
>   
>> Yong and I ran most of our tests (from Swift) using WS-GRAM (aka GRAM4) 
>> on UC/ANL TG, and I use Falkon on the same cluster using only WS-GRAM.  
>> If I am not mistaken, all TG sites support WS-GRAM.
>>
>> Ioan
>>
>> Michael Wilde wrote:
>>     
>>> MikeK, this may be obvious but just in case:
>>>
>>> On 1/29/08 8:47 PM, Mihael Hategan wrote:
>>>       
>>>> That and/or try using ws-gram:
>>>> <jobmanager universe="vanilla" url="tg-grid1.uc.teragrid.org" major="4"
>>>> minor="0" patch="0"/>
>>>>         
>>> (this goes in the sites.xml file)
>>>
>>> Q for the group: is ws-gram supported on uc.teragrid?
>>>
>>>       
>>>> On Tue, 2008-01-29 at 20:42 -0600, Mihael Hategan wrote:
>>>>         
>>>>> You may want to try to lower throttle.score.job.factor from 4 to 1. 
>>>>> That
>>>>> will cap the number of jobs at ~100 instead of ~400.
>>>>>
>>>>> Mihael
>>>>>           
>>> for info on setting Swift properties, see "Swift Engine Configuration" 
>>> in the users guide at:
>>>
>>> http://www.ci.uchicago.edu/swift/guides/userguide.php#properties
>>>
>>> - MikeW
>>>
>>>       
>>>>> On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote:
>>>>>           
>>>>>> sorry, long day : )
>>>>>>
>>>>>>
>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>>>
>>>>>>             
>>>>>>> On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde
>>>>>>> wrote:
>>>>>>>               
>>>>>>>> MikeK, no attachment.
>>>>>>>>
>>>>>>>> Ive narrowed the cc list, and need to read back
>>>>>>>>                 
>>>>>>> through the email thread
>>>>>>>               
>>>>>>>>   on this to see what Mihael observed.
>>>>>>>>                 
>>>>>>> Let me summarize: too many gt2 gram jobs running
>>>>>>> concurrently = too many
>>>>>>> job manager processes = high load on gram node. Not
>>>>>>> a new issue.
>>>>>>>
>>>>>>>               
>>>>>>>> - MikeW
>>>>>>>>
>>>>>>>> On 1/29/08 8:00 PM, Mike Kubal wrote:
>>>>>>>>                 
>>>>>>>>> The attachment contains the swift script, tc
>>>>>>>>>                   
>>>>>>> file,
>>>>>>>               
>>>>>>>>> sites file and swift.properties file.
>>>>>>>>>
>>>>>>>>> I didn't provide any additional command line
>>>>>>>>> arguments.
>>>>>>>>>
>>>>>>>>> MikeK
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --- Michael Wilde <wilde at mcs.anl.gov> wrote:
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> [ was Re: Swift jobs on UC/ANL TG ]
>>>>>>>>>>
>>>>>>>>>> Hi. Im at OHare and will be flying soon.
>>>>>>>>>> Ben or Mihael, if you are online, can you
>>>>>>>>>> investigate?
>>>>>>>>>>
>>>>>>>>>> Yes, there are significant throttles turned on
>>>>>>>>>>                     
>>>>>>> by
>>>>>>>               
>>>>>>>>>> default, and the system opens those very gradually.
>>>>>>>>>>
>>>>>>>>>> MikeK, can you post to the swift-devel list
>>>>>>>>>>                     
>>>>>>> your
>>>>>>>               
>>>>>>>>>> swift.properties file, command line options, and your swift source
>>>>>>>>>>                     
>>>>>>> code?
>>>>>>>               
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> MikeW
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 1/29/08 8:11 AM, Ti Leggett wrote:
>>>>>>>>>>                     
>>>>>>>>>>> The default walltime is 15 minutes. Are you
>>>>>>>>>>>                       
>>>>>>> doing
>>>>>>>               
>>>>>>>>>> fork jobs or pbs jobs?
>>>>>>>>>>                     
>>>>>>>>>>> You shouldn't be doing fork jobs at all. Mike
>>>>>>>>>>>                       
>>>>>>> W, I
>>>>>>>               
>>>>>>>>>> thought there were
>>>>>>>>>>                     
>>>>>>>>>>> throttles in place in Swift to prevent this
>>>>>>>>>>>                       
>>>>>>> type
>>>>>>>               
>>>>>>>>>> of overrun? Mike K,
>>>>>>>>>>                     
>>>>>>>>>>> I'll need you to either stop these types of
>>>>>>>>>>>                       
>>>>>>> jobs
>>>>>>>               
>>>>>>>>>> until Mike W can verify
>>>>>>>>>>                     
>>>>>>>>>>> throttling or only submit a few 10s of jobs at
>>>>>>>>>>>                       
>>>>>>> a
>>>>>>>               
>>>>>>>>>> time.
>>>>>>>>>>                     
>>>>>>>>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike
>>>>>>>>>>>                       
>>>>>>> Kubal
>>>>>>>               
>>>>>>>>>> wrote:
>>>>>>>>>>                     
>>>>>>>>>>>> Yes, I'm submitting molecular dynamics
>>>>>>>>>>>>                         
>>>>>>>>>> simulations
>>>>>>>>>>                     
>>>>>>>>>>>> using Swift.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a default wall-time limit for jobs
>>>>>>>>>>>>                         
>>>>>>> on
>>>>>>>               
>>>>>>>>>> tg-uc?
>>>>>>>>>>                     
>>>>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>                         
>>>>>>>>>>>>> Actually, these numbers are now
>>>>>>>>>>>>>                           
>>>>>>> escalating...
>>>>>>>               
>>>>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>>>>>>>>>>>                           
>>>>>>> average:
>>>>>>>               
>>>>>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>>>>>>>>>>>                           
>>>>>>>   0
>>>>>>>               
>>>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>>
>>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc
>>>>>>>>>>>>>                           
>>>>>>> -l
>>>>>>>               
>>>>>>>>>>>>>     479
>>>>>>>>>>>>>
>>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>>> real    0m26.134s
>>>>>>>>>>>>> user    0m0.090s
>>>>>>>>>>>>> sys     0m0.010s
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>>>>>>>>>>                           
>>>>>>>>>> wrote:
>>>>>>>>>>                     
>>>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>>>>>>>>>>>                             
>>>>>>>>>> UC/ANL
>>>>>>>>>>                     
>>>>>>>>>>>>> TG GRAM host)
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> became unresponsive and had to be rebooted.
>>>>>>>>>>>>>>                             
>>>>>>>  I
>>>>>>>               
>>>>>>>>>> am
>>>>>>>>>>                     
>>>>>>>>>>>>> now seeing slow
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> response times from the Gatekeeper there
>>>>>>>>>>>>>>                             
>>>>>>> again.
>>>>>>>               
>>>>>>>>>>>>> Authenticating to
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> the gatekeeper should only take a second or
>>>>>>>>>>>>>>                             
>>>>>>>>>> two,
>>>>>>>>>>                     
>>>>>>>>>>>>> but it is
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a
>>>>>>>>>>>>>>                             
>>>>>>> -r
>>>>>>>               
>>>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>>>> real    0m16.096s
>>>>>>>>>>>>>> user    0m0.060s
>>>>>>>>>>>>>> sys     0m0.020s
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> looking at the load on tg-grid, it is
>>>>>>>>>>>>>>                             
>>>>>>> rather
>>>>>>>               
>>>>>>>>>> high:
>>>>>>>>>>                     
>>>>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>>>>>>>>>>>>                             
>>>>>>>>>> average:
>>>>>>>>>>                     
>>>>>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> Tasks: 398 total,  20 running, 378
>>>>>>>>>>>>>>                             
>>>>>>> sleeping, 
>>>>>>>               
>>>>>>>>>> 0
>>>>>>>>>>                     
>>>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> And there appear to be a large number of
>>>>>>>>>>>>>>                             
>>>>>>>>>> processes
>>>>>>>>>>                     
>>>>>>>>>>>>> owned by kubal:
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc
>>>>>>>>>>>>>>                             
>>>>>>> -l
>>>>>>>               
>>>>>>>>>>>>>>    380
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I assume that Mike is using swift to do the
>>>>>>>>>>>>>>                             
>>>>>>> job
>>>>>>>               
>>>>>>>>>>>>> submission.  Is
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> there some throttling of the rate at which
>>>>>>>>>>>>>>                             
>>>>>>> jobs
>>>>>>>               
>>>>>>>>>>>>> are submitted to
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> the gatekeeper that could be done that
>>>>>>>>>>>>>>                             
>>>>>>> would
>>>>>>>               
>>>>>>>>>>>>> lighten this load
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>>>>>>                             
>>>>>>>>>>>>> earlier today?)  The
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> current response times are not
>>>>>>>>>>>>>>                             
>>>>>>> unacceptable,
>>>>>>>               
>>>>>>>>>> but
>>>>>>>>>>                     
>>>>>>>>>>>>> I'm hoping to
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> avoid having the machine grind to a halt as
>>>>>>>>>>>>>>                             
>>>>>>> it
>>>>>>>               
>>>>>>>>>> did
>>>>>>>>>>                     
>>>>>>>>>>>>> earlier today.
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> joe.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                             
>>>>>>> ===================================================
>>>>>>>               
>>>>>>>>>>>>>> joseph a.
>>>>>>>>>>>>>> insley
>>>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>>>>>>                             
>>>>>>>>>>>>> (630) 252-5649
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>       (630)
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                             
>>>>>>> ===================================================
>>>>>>>               
>>>>>>>>>>>>> joseph a. insley
>>>>>>>>>>>>>
>>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>>> mathematics & computer science division     
>>>>>>>>>>>>>                           
>>>>>>>>>> (630)
>>>>>>>>>>                     
>>>>>>>>>>>>> 252-5649
>>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>>>     (630)
>>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                           
>>>>>>>>>>>>      
>>>>>>>>>>>>                         
>>>>>> === message truncated ===
>>>>>>
>>>>>>
>>>>>>       
>>>>>> ____________________________________________________________________________________ 
>>>>>>
>>>>>> Looking for last minute shopping deals?  Find them fast with Yahoo! 
>>>>>> Search.  
>>>>>> http://tools.search.yahoo.com/newsearch/category.php?category=shopping
>>>>>>             
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>           
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>>>
>>>>         
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>       
>
>
>   

-- 
==================================================
Ioan Raicu
Ph.D. Candidate
==================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
==================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS
==================================================
==================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080129/4d620685/attachment.html>


More information about the Swift-devel mailing list