[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Michael Wilde wilde at mcs.anl.gov
Tue Jan 29 21:04:17 CST 2008


MikeK, this may be obvious but just in case:

On 1/29/08 8:47 PM, Mihael Hategan wrote:
> That and/or try using ws-gram:
> <jobmanager universe="vanilla" url="tg-grid1.uc.teragrid.org" major="4"
> minor="0" patch="0"/>

(this goes in the sites.xml file)

Q for the group: is ws-gram supported on uc.teragrid?

> 
> 
> On Tue, 2008-01-29 at 20:42 -0600, Mihael Hategan wrote:
>> You may want to try to lower throttle.score.job.factor from 4 to 1. That
>> will cap the number of jobs at ~100 instead of ~400.
>>
>> Mihael

for info on setting Swift properties, see "Swift Engine Configuration" 
in the users guide at:

http://www.ci.uchicago.edu/swift/guides/userguide.php#properties

- MikeW

>>
>> On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote:
>>> sorry, long day : )
>>>
>>>
>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>
>>>> On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde
>>>> wrote:
>>>>> MikeK, no attachment.
>>>>>
>>>>> Ive narrowed the cc list, and need to read back
>>>> through the email thread 
>>>>>   on this to see what Mihael observed.
>>>> Let me summarize: too many gt2 gram jobs running
>>>> concurrently = too many
>>>> job manager processes = high load on gram node. Not
>>>> a new issue.
>>>>
>>>>> - MikeW
>>>>>
>>>>> On 1/29/08 8:00 PM, Mike Kubal wrote:
>>>>>> The attachment contains the swift script, tc
>>>> file,
>>>>>> sites file and swift.properties file.
>>>>>>
>>>>>> I didn't provide any additional command line
>>>>>> arguments.
>>>>>>
>>>>>> MikeK
>>>>>>
>>>>>>
>>>>>> --- Michael Wilde <wilde at mcs.anl.gov> wrote:
>>>>>>
>>>>>>> [ was Re: Swift jobs on UC/ANL TG ]
>>>>>>>
>>>>>>> Hi. Im at OHare and will be flying soon.
>>>>>>> Ben or Mihael, if you are online, can you
>>>>>>> investigate?
>>>>>>>
>>>>>>> Yes, there are significant throttles turned on
>>>> by
>>>>>>> default, and the 
>>>>>>> system opens those very gradually.
>>>>>>>
>>>>>>> MikeK, can you post to the swift-devel list
>>>> your
>>>>>>> swift.properties file, 
>>>>>>> command line options, and your swift source
>>>> code?
>>>>>>> Thanks,
>>>>>>>
>>>>>>> MikeW
>>>>>>>
>>>>>>>
>>>>>>> On 1/29/08 8:11 AM, Ti Leggett wrote:
>>>>>>>> The default walltime is 15 minutes. Are you
>>>> doing
>>>>>>> fork jobs or pbs jobs? 
>>>>>>>> You shouldn't be doing fork jobs at all. Mike
>>>> W, I
>>>>>>> thought there were 
>>>>>>>> throttles in place in Swift to prevent this
>>>> type
>>>>>>> of overrun? Mike K, 
>>>>>>>> I'll need you to either stop these types of
>>>> jobs
>>>>>>> until Mike W can verify 
>>>>>>>> throttling or only submit a few 10s of jobs at
>>>> a
>>>>>>> time.
>>>>>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike
>>>> Kubal
>>>>>>> wrote:
>>>>>>>>> Yes, I'm submitting molecular dynamics
>>>>>>> simulations
>>>>>>>>> using Swift.
>>>>>>>>>
>>>>>>>>> Is there a default wall-time limit for jobs
>>>> on
>>>>>>> tg-uc?
>>>>>>>>>
>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>>
>>>>>>>>>> Actually, these numbers are now
>>>> escalating...
>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>> average:
>>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>>   0
>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>
>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc
>>>> -l
>>>>>>>>>>     479
>>>>>>>>>>
>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>> real    0m26.134s
>>>>>>>>>> user    0m0.090s
>>>>>>>>>> sys     0m0.010s
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>>>> wrote:
>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>>>> UC/ANL
>>>>>>>>>> TG GRAM host)
>>>>>>>>>>> became unresponsive and had to be rebooted.
>>>>  I
>>>>>>> am
>>>>>>>>>> now seeing slow
>>>>>>>>>>> response times from the Gatekeeper there
>>>> again.
>>>>>>>>>> Authenticating to
>>>>>>>>>>> the gatekeeper should only take a second or
>>>>>>> two,
>>>>>>>>>> but it is
>>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>>
>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a
>>>> -r
>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>> real    0m16.096s
>>>>>>>>>>> user    0m0.060s
>>>>>>>>>>> sys     0m0.020s
>>>>>>>>>>>
>>>>>>>>>>> looking at the load on tg-grid, it is
>>>> rather
>>>>>>> high:
>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>>>>> average:
>>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>>> Tasks: 398 total,  20 running, 378
>>>> sleeping,  
>>>>>>> 0
>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>> And there appear to be a large number of
>>>>>>> processes
>>>>>>>>>> owned by kubal:
>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc
>>>> -l
>>>>>>>>>>>    380
>>>>>>>>>>>
>>>>>>>>>>> I assume that Mike is using swift to do the
>>>> job
>>>>>>>>>> submission.  Is
>>>>>>>>>>> there some throttling of the rate at which
>>>> jobs
>>>>>>>>>> are submitted to
>>>>>>>>>>> the gatekeeper that could be done that
>>>> would
>>>>>>>>>> lighten this load
>>>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>> earlier today?)  The
>>>>>>>>>>> current response times are not
>>>> unacceptable,
>>>>>>> but
>>>>>>>>>> I'm hoping to
>>>>>>>>>>> avoid having the machine grind to a halt as
>>>> it
>>>>>>> did
>>>>>>>>>> earlier today.
>>>>>>>>>>> Thanks,
>>>>>>>>>>> joe.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>> ===================================================
>>>>>>>>>>> joseph a.
>>>>>>>>>>> insley
>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>> (630) 252-5649
>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>       (630)
>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>
>>>>>>>>>>>
>>>> ===================================================
>>>>>>>>>> joseph a. insley
>>>>>>>>>>
>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>> mathematics & computer science division     
>>>>>>> (630)
>>>>>>>>>> 252-5649
>>>>>>>>>> argonne national laboratory
>>>>>>>>>>     (630)
>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      
>>> === message truncated ===
>>>
>>>
>>>       ____________________________________________________________________________________
>>> Looking for last minute shopping deals?  
>>> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 



More information about the Swift-devel mailing list