[Swift-devel] hprof profiling of coaster services
Michael Wilde
wilde at mcs.anl.gov
Fri Jun 26 07:43:00 CDT 2009
Allan,
Its not clear to me that you have really reproduced the problem yet.
My understanding was that Ranger sysadmins observed two processes, owned
by you and Zhao, burning 100% CPU on login3, which is also the GRAM
gatekeeper host. This was when you were running "mock BLAST" and Zhao
was running AMPL. I know that the AMPL tasks were running 1-2 hours, so
one would expect negligible overhead from the coaster service in that
case. Your BLAST tasks were, I thought, about 60 seconds in length, for
which I would also thing that the coaster service would have low overhead.
So I dont think that running 2000 "touch" processes is a good way to
reproduce the problem. If you dont see high coaster service overhead
with something like "sleep 60" (or better yet, just use your same
mock-BLAST run) then I think that the problem has not yet been reproduced.
The oringinal thread where the Ranger sysadmin complained is below.
Looking back at it, do we have clear evidence that it was really coaster
services, v.s., say, GRAM jobmanagers, that was causing the load?
Is it possible that coaster settings caused too many GRAM jobs to be run?
I think we should do this:
- review the evidence to see if high coaster service CPU % was really
observed
- if so, run the BLAST test elsewhere, and see if it causes such overhead
- run several tests of large numbers of sleep jobs, and record the
coaster service CPU utilization. You could have the coaster service log
its own CPU and memory utilization to the logfile every 60 seconds say.
You could overload the #coasters per host (say to 16 or higher since
they are sleep jobs), so you can readily do this on teraport.
From such a plot, we could more scientifically see if we have a coaster
service overhead problem or not.
In other words, the Ranger sysadmin did not say "your coaster process is
consuming CPU", he just said your jobs are causing the login3 host to be
slow for other users.
I think the only evidence that points to coasters is your observations,
Allan, when you and Zhao were running. Is it possible that one of you
was running the Swift command on login3 (in addition to its role as a
gatekeeper host)?
Lets go back and review this carefully, as this sysadmin complaint has
essentially shut down production coaster usage, which is bad, and we
need to determine what was the real cause of the situation that caused
the complaint.
- Mike
On 6/24/09 6:30 PM, Mihael Hategan wrote:
> Try the following:
>
> In the source tree, edit
> cog/modules/provider-coaster/resources/log4j.properties and change the
> INFO categories to WARN.
>
> Then re-compile and see if the usage is still high.
>
> On Tue, 2009-06-23 at 17:11 -0500, Allan Espinosa wrote:
>> ok this looks like a good replicable case.
>>
>> the workflow is 2000 invocations of touch using 066-many.swift
>>
>> run04.tar.gz (logs and a "top -b" dump for the coaster service). cpu
>> utilization averages to 99-100% . i ran this for five trials and got
>> the same results.
>>
>> run05.tar.gz - same run with profiling information. in java.hprof.bin
>>
>> 2009/6/19 Mihael Hategan <hategan at mcs.anl.gov>:
>>> On Fri, 2009-06-19 at 18:30 -0500, Allan Espinosa wrote:
>>>> here's a script recording without profiling of top:
>>>>
>>>> http://www.ci.uchicago.edu/~aespinosa/top_vanilla.gz
>>>>
>>>> it still consumes some cpu. but does not spike to 200% utilization.
>>> Right, 10% != 200%.
>>>
>>> Now, as you're probably already guessing, I would need:
>>> 1. a situation (workflow/site/etc.) in which the usage does go crazy
>>> without the profiler (as in what triggered you getting kicked off
>>> ranger); repeatable
>>> 2. a profiler dump of a run in such a situation
>>>
>>> Btw, the ranger issue, where was swift running?
>> on communicado.
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list