[Swift-devel] Support request: Swift jobs flooding uc-teragrid?
Ioan Raicu
iraicu at cs.uchicago.edu
Tue Jan 29 13:38:03 CST 2008
Can someone double check that the jobs are using PBS (and not FORK) in
GRAM? If you are using FORK, then the high load is being caused by the
applications running on the GRAM host. If it is PBS, then I don't know,
others might have more insight.
Ioan
Ian Foster wrote:
> Hi,
>
> I've CCed Stuart Martin--I'd greatly appreciate some insights into
> what is causing this. I assume that you are using GRAM4 (aka WS-GRAM)?
>
> Ian.
>
> Michael Wilde wrote:
>> [ was Re: Swift jobs on UC/ANL TG ]
>>
>> Hi. Im at OHare and will be flying soon.
>> Ben or Mihael, if you are online, can you investigate?
>>
>> Yes, there are significant throttles turned on by default, and the
>> system opens those very gradually.
>>
>> MikeK, can you post to the swift-devel list your swift.properties
>> file, command line options, and your swift source code?
>>
>> Thanks,
>>
>> MikeW
>>
>>
>> On 1/29/08 8:11 AM, Ti Leggett wrote:
>>> The default walltime is 15 minutes. Are you doing fork jobs or pbs
>>> jobs? You shouldn't be doing fork jobs at all. Mike W, I thought
>>> there were throttles in place in Swift to prevent this type of
>>> overrun? Mike K, I'll need you to either stop these types of jobs
>>> until Mike W can verify throttling or only submit a few 10s of jobs
>>> at a time.
>>>
>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote:
>>>
>>>> Yes, I'm submitting molecular dynamics simulations
>>>> using Swift.
>>>>
>>>> Is there a default wall-time limit for jobs on tg-uc?
>>>>
>>>>
>>>>
>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>
>>>>> Actually, these numbers are now escalating...
>>>>>
>>>>> top - 17:18:54 up 2:29, 1 user, load average:
>>>>> 149.02, 123.63, 91.94
>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0
>>>>> stopped, 0 zombie
>>>>>
>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>> 479
>>>>>
>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>> tg-grid.uc.teragrid.org
>>>>> GRAM Authentication test successful
>>>>> real 0m26.134s
>>>>> user 0m0.090s
>>>>> sys 0m0.010s
>>>>>
>>>>>
>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>>>
>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>>>> TG GRAM host)
>>>>>> became unresponsive and had to be rebooted. I am
>>>>> now seeing slow
>>>>>> response times from the Gatekeeper there again.
>>>>> Authenticating to
>>>>>> the gatekeeper should only take a second or two,
>>>>> but it is
>>>>>> periodically taking up to 16 seconds:
>>>>>>
>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>> tg-grid.uc.teragrid.org
>>>>>> GRAM Authentication test successful
>>>>>> real 0m16.096s
>>>>>> user 0m0.060s
>>>>>> sys 0m0.020s
>>>>>>
>>>>>> looking at the load on tg-grid, it is rather high:
>>>>>>
>>>>>> top - 16:55:26 up 2:06, 1 user, load average:
>>>>> 89.59, 78.69, 62.92
>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0
>>>>> stopped, 0 zombie
>>>>>>
>>>>>> And there appear to be a large number of processes
>>>>> owned by kubal:
>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>> 380
>>>>>>
>>>>>> I assume that Mike is using swift to do the job
>>>>> submission. Is
>>>>>> there some throttling of the rate at which jobs
>>>>> are submitted to
>>>>>> the gatekeeper that could be done that would
>>>>> lighten this load
>>>>>> some? (Or has that already been done since
>>>>> earlier today?) The
>>>>>> current response times are not unacceptable, but
>>>>> I'm hoping to
>>>>>> avoid having the machine grind to a halt as it did
>>>>> earlier today.
>>>>>>
>>>>>> Thanks,
>>>>>> joe.
>>>>>>
>>>>>>
>>>>>>
>>>>> ===================================================
>>>>>> joseph a.
>>>>>> insley
>>>>>
>>>>>> insley at mcs.anl.gov
>>>>>> mathematics & computer science division
>>>>> (630) 252-5649
>>>>>> argonne national laboratory
>>>>> (630)
>>>>>> 252-5986 (fax)
>>>>>>
>>>>>>
>>>>>
>>>>> ===================================================
>>>>> joseph a. insley
>>>>>
>>>>> insley at mcs.anl.gov
>>>>> mathematics & computer science division (630)
>>>>> 252-5649
>>>>> argonne national laboratory
>>>>> (630)
>>>>> 252-5986 (fax)
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ____________________________________________________________________________________
>>>>
>>>> Be a better friend, newshound, and
>>>> know-it-all with Yahoo! Mobile. Try it now.
>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>
>>>
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
--
==================================================
Ioan Raicu
Ph.D. Candidate
==================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
==================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS
==================================================
==================================================
More information about the Swift-devel
mailing list