[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Ioan Raicu iraicu at cs.uchicago.edu
Tue Jan 29 21:25:49 CST 2008


Yong and I ran most of our tests (from Swift) using WS-GRAM (aka GRAM4) 
on UC/ANL TG, and I use Falkon on the same cluster using only WS-GRAM.  
If I am not mistaken, all TG sites support WS-GRAM.

Ioan

Michael Wilde wrote:
> MikeK, this may be obvious but just in case:
>
> On 1/29/08 8:47 PM, Mihael Hategan wrote:
>> That and/or try using ws-gram:
>> <jobmanager universe="vanilla" url="tg-grid1.uc.teragrid.org" major="4"
>> minor="0" patch="0"/>
>
> (this goes in the sites.xml file)
>
> Q for the group: is ws-gram supported on uc.teragrid?
>
>>
>>
>> On Tue, 2008-01-29 at 20:42 -0600, Mihael Hategan wrote:
>>> You may want to try to lower throttle.score.job.factor from 4 to 1. 
>>> That
>>> will cap the number of jobs at ~100 instead of ~400.
>>>
>>> Mihael
>
> for info on setting Swift properties, see "Swift Engine Configuration" 
> in the users guide at:
>
> http://www.ci.uchicago.edu/swift/guides/userguide.php#properties
>
> - MikeW
>
>>>
>>> On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote:
>>>> sorry, long day : )
>>>>
>>>>
>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>
>>>>> On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde
>>>>> wrote:
>>>>>> MikeK, no attachment.
>>>>>>
>>>>>> Ive narrowed the cc list, and need to read back
>>>>> through the email thread
>>>>>>   on this to see what Mihael observed.
>>>>> Let me summarize: too many gt2 gram jobs running
>>>>> concurrently = too many
>>>>> job manager processes = high load on gram node. Not
>>>>> a new issue.
>>>>>
>>>>>> - MikeW
>>>>>>
>>>>>> On 1/29/08 8:00 PM, Mike Kubal wrote:
>>>>>>> The attachment contains the swift script, tc
>>>>> file,
>>>>>>> sites file and swift.properties file.
>>>>>>>
>>>>>>> I didn't provide any additional command line
>>>>>>> arguments.
>>>>>>>
>>>>>>> MikeK
>>>>>>>
>>>>>>>
>>>>>>> --- Michael Wilde <wilde at mcs.anl.gov> wrote:
>>>>>>>
>>>>>>>> [ was Re: Swift jobs on UC/ANL TG ]
>>>>>>>>
>>>>>>>> Hi. Im at OHare and will be flying soon.
>>>>>>>> Ben or Mihael, if you are online, can you
>>>>>>>> investigate?
>>>>>>>>
>>>>>>>> Yes, there are significant throttles turned on
>>>>> by
>>>>>>>> default, and the system opens those very gradually.
>>>>>>>>
>>>>>>>> MikeK, can you post to the swift-devel list
>>>>> your
>>>>>>>> swift.properties file, command line options, and your swift source
>>>>> code?
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> MikeW
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/29/08 8:11 AM, Ti Leggett wrote:
>>>>>>>>> The default walltime is 15 minutes. Are you
>>>>> doing
>>>>>>>> fork jobs or pbs jobs?
>>>>>>>>> You shouldn't be doing fork jobs at all. Mike
>>>>> W, I
>>>>>>>> thought there were
>>>>>>>>> throttles in place in Swift to prevent this
>>>>> type
>>>>>>>> of overrun? Mike K,
>>>>>>>>> I'll need you to either stop these types of
>>>>> jobs
>>>>>>>> until Mike W can verify
>>>>>>>>> throttling or only submit a few 10s of jobs at
>>>>> a
>>>>>>>> time.
>>>>>>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike
>>>>> Kubal
>>>>>>>> wrote:
>>>>>>>>>> Yes, I'm submitting molecular dynamics
>>>>>>>> simulations
>>>>>>>>>> using Swift.
>>>>>>>>>>
>>>>>>>>>> Is there a default wall-time limit for jobs
>>>>> on
>>>>>>>> tg-uc?
>>>>>>>>>>
>>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>>>
>>>>>>>>>>> Actually, these numbers are now
>>>>> escalating...
>>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>>> average:
>>>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>>>   0
>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>
>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc
>>>>> -l
>>>>>>>>>>>     479
>>>>>>>>>>>
>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>> real    0m26.134s
>>>>>>>>>>> user    0m0.090s
>>>>>>>>>>> sys     0m0.010s
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>>>>> wrote:
>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>>>>> UC/ANL
>>>>>>>>>>> TG GRAM host)
>>>>>>>>>>>> became unresponsive and had to be rebooted.
>>>>>  I
>>>>>>>> am
>>>>>>>>>>> now seeing slow
>>>>>>>>>>>> response times from the Gatekeeper there
>>>>> again.
>>>>>>>>>>> Authenticating to
>>>>>>>>>>>> the gatekeeper should only take a second or
>>>>>>>> two,
>>>>>>>>>>> but it is
>>>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>>>
>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a
>>>>> -r
>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>> real    0m16.096s
>>>>>>>>>>>> user    0m0.060s
>>>>>>>>>>>> sys     0m0.020s
>>>>>>>>>>>>
>>>>>>>>>>>> looking at the load on tg-grid, it is
>>>>> rather
>>>>>>>> high:
>>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>>>>>> average:
>>>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>>>> Tasks: 398 total,  20 running, 378
>>>>> sleeping, 
>>>>>>>> 0
>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>> And there appear to be a large number of
>>>>>>>> processes
>>>>>>>>>>> owned by kubal:
>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc
>>>>> -l
>>>>>>>>>>>>    380
>>>>>>>>>>>>
>>>>>>>>>>>> I assume that Mike is using swift to do the
>>>>> job
>>>>>>>>>>> submission.  Is
>>>>>>>>>>>> there some throttling of the rate at which
>>>>> jobs
>>>>>>>>>>> are submitted to
>>>>>>>>>>>> the gatekeeper that could be done that
>>>>> would
>>>>>>>>>>> lighten this load
>>>>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>>> earlier today?)  The
>>>>>>>>>>>> current response times are not
>>>>> unacceptable,
>>>>>>>> but
>>>>>>>>>>> I'm hoping to
>>>>>>>>>>>> avoid having the machine grind to a halt as
>>>>> it
>>>>>>>> did
>>>>>>>>>>> earlier today.
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> joe.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>> ===================================================
>>>>>>>>>>>> joseph a.
>>>>>>>>>>>> insley
>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>>> (630) 252-5649
>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>       (630)
>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>> ===================================================
>>>>>>>>>>> joseph a. insley
>>>>>>>>>>>
>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>> mathematics & computer science division     
>>>>>>>> (630)
>>>>>>>>>>> 252-5649
>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>     (630)
>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>      
>>>> === message truncated ===
>>>>
>>>>
>>>>       
>>>> ____________________________________________________________________________________ 
>>>>
>>>> Looking for last minute shopping deals?  Find them fast with Yahoo! 
>>>> Search.  
>>>> http://tools.search.yahoo.com/newsearch/category.php?category=shopping
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
==================================================
Ioan Raicu
Ph.D. Candidate
==================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
==================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS
==================================================
==================================================





More information about the Swift-devel mailing list