[Swift-devel] Support request: Swift jobs flooding uc-teragrid?
joseph insley
insley at mcs.anl.gov
Tue Jan 29 13:57:40 CST 2008
I was seeing Mike's jobs show up in the queue, and running on the
backend nodes, and the processes I was seeing on tg-grid appeared to
be gram and not some other application, so it would seem that it was
indeed using PBS.
However, it appears to be using PRE-WS GRAM.... I still had some of
the 'ps | grep kubal' output in my scrollback:
insley at tg-grid1:~> ps -ef | grep kubal
kubal 16981 1 0 16:41 ? 00:00:00 globus-job-manager -
conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs -
rdn jobmanager-pbs -machine-type unknown -publish-jobs
kubal 18390 1 0 16:42 ? 00:00:00 globus-job-manager -
conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs -
rdn jobmanager-pbs -machine-type unknown -publish-jobs
kubal 18891 1 0 16:43 ? 00:00:00 globus-job-manager -
conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs -
rdn jobmanager-pbs -machine-type unknown -publish-jobs
kubal 18917 1 0 16:43 ?
[snip]
kubal 28200 25985 0 16:50 ? 00:00:00 /usr/bin/perl /soft/
prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f /
tmp/gram_iwEHrc -c poll
kubal 28201 26954 1 16:50 ? 00:00:00 /usr/bin/perl /soft/
prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f /
tmp/gram_lQaIPe -c poll
kubal 28202 19438 1 16:50 ? 00:00:00 /usr/bin/perl /soft/
prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f /
tmp/gram_SPsdme -c poll
On Jan 29, 2008, at 1:38 PM, Ioan Raicu wrote:
> Can someone double check that the jobs are using PBS (and not FORK)
> in GRAM? If you are using FORK, then the high load is being caused
> by the applications running on the GRAM host. If it is PBS, then I
> don't know, others might have more insight.
>
> Ioan
>
> Ian Foster wrote:
>> Hi,
>>
>> I've CCed Stuart Martin--I'd greatly appreciate some insights into
>> what is causing this. I assume that you are using GRAM4 (aka WS-
>> GRAM)?
>>
>> Ian.
>>
>> Michael Wilde wrote:
>>> [ was Re: Swift jobs on UC/ANL TG ]
>>>
>>> Hi. Im at OHare and will be flying soon.
>>> Ben or Mihael, if you are online, can you investigate?
>>>
>>> Yes, there are significant throttles turned on by default, and
>>> the system opens those very gradually.
>>>
>>> MikeK, can you post to the swift-devel list your swift.properties
>>> file, command line options, and your swift source code?
>>>
>>> Thanks,
>>>
>>> MikeW
>>>
>>>
>>> On 1/29/08 8:11 AM, Ti Leggett wrote:
>>>> The default walltime is 15 minutes. Are you doing fork jobs or
>>>> pbs jobs? You shouldn't be doing fork jobs at all. Mike W, I
>>>> thought there were throttles in place in Swift to prevent this
>>>> type of overrun? Mike K, I'll need you to either stop these
>>>> types of jobs until Mike W can verify throttling or only submit
>>>> a few 10s of jobs at a time.
>>>>
>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote:
>>>>
>>>>> Yes, I'm submitting molecular dynamics simulations
>>>>> using Swift.
>>>>>
>>>>> Is there a default wall-time limit for jobs on tg-uc?
>>>>>
>>>>>
>>>>>
>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>
>>>>>> Actually, these numbers are now escalating...
>>>>>>
>>>>>> top - 17:18:54 up 2:29, 1 user, load average:
>>>>>> 149.02, 123.63, 91.94
>>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0
>>>>>> stopped, 0 zombie
>>>>>>
>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>> 479
>>>>>>
>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>> tg-grid.uc.teragrid.org
>>>>>> GRAM Authentication test successful
>>>>>> real 0m26.134s
>>>>>> user 0m0.090s
>>>>>> sys 0m0.010s
>>>>>>
>>>>>>
>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>>>>
>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>>>>> TG GRAM host)
>>>>>>> became unresponsive and had to be rebooted. I am
>>>>>> now seeing slow
>>>>>>> response times from the Gatekeeper there again.
>>>>>> Authenticating to
>>>>>>> the gatekeeper should only take a second or two,
>>>>>> but it is
>>>>>>> periodically taking up to 16 seconds:
>>>>>>>
>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>> tg-grid.uc.teragrid.org
>>>>>>> GRAM Authentication test successful
>>>>>>> real 0m16.096s
>>>>>>> user 0m0.060s
>>>>>>> sys 0m0.020s
>>>>>>>
>>>>>>> looking at the load on tg-grid, it is rather high:
>>>>>>>
>>>>>>> top - 16:55:26 up 2:06, 1 user, load average:
>>>>>> 89.59, 78.69, 62.92
>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0
>>>>>> stopped, 0 zombie
>>>>>>>
>>>>>>> And there appear to be a large number of processes
>>>>>> owned by kubal:
>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>> 380
>>>>>>>
>>>>>>> I assume that Mike is using swift to do the job
>>>>>> submission. Is
>>>>>>> there some throttling of the rate at which jobs
>>>>>> are submitted to
>>>>>>> the gatekeeper that could be done that would
>>>>>> lighten this load
>>>>>>> some? (Or has that already been done since
>>>>>> earlier today?) The
>>>>>>> current response times are not unacceptable, but
>>>>>> I'm hoping to
>>>>>>> avoid having the machine grind to a halt as it did
>>>>>> earlier today.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> joe.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> ===================================================
>>>>>>> joseph a.
>>>>>>> insley
>>>>>>
>>>>>>> insley at mcs.anl.gov
>>>>>>> mathematics & computer science division
>>>>>> (630) 252-5649
>>>>>>> argonne national laboratory
>>>>>> (630)
>>>>>>> 252-5986 (fax)
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ===================================================
>>>>>> joseph a. insley
>>>>>>
>>>>>> insley at mcs.anl.gov
>>>>>> mathematics & computer science division (630)
>>>>>> 252-5649
>>>>>> argonne national laboratory
>>>>>> (630)
>>>>>> 252-5986 (fax)
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> __________________________________________________________________
>>>>> __________________
>>>>> Be a better friend, newshound, and
>>>>> know-it-all with Yahoo! Mobile. Try it now. http://
>>>>> mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>
> --
> ==================================================
> Ioan Raicu
> Ph.D. Candidate
> ==================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ==================================================
> Email: iraicu at cs.uchicago.edu
> Web: http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS
> ==================================================
> ==================================================
>
>
===================================================
joseph a. insley
insley at mcs.anl.gov
mathematics & computer science division (630) 252-5649
argonne national laboratory (630)
252-5986 (fax)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080129/597553e8/attachment.html>
More information about the Swift-devel
mailing list