[Swift-devel] hprof profiling of coaster services

Fri Jun 26 07:43:00 CDT 2009

Allan,

Its not clear to me that you have really reproduced the problem yet.

My understanding was that Ranger sysadmins observed two processes, owned 
by you and Zhao, burning 100% CPU on login3, which is also the GRAM 
gatekeeper host. This was when you were running "mock BLAST" and Zhao 
was running AMPL.  I know that the AMPL tasks were running 1-2 hours, so 
one would expect negligible overhead from the coaster service in that 
case. Your BLAST tasks were, I thought, about 60 seconds in length, for 
which I would also thing that the coaster service would have low overhead.

So I dont think that running 2000 "touch" processes is a good way to 
reproduce the problem. If you dont see high coaster service overhead 
with something like "sleep 60" (or better yet, just use your same 
mock-BLAST run) then I think that the problem has not yet been reproduced.

The oringinal thread where the Ranger sysadmin complained is below. 
Looking back at it, do we have clear evidence that it was really coaster 
services, v.s., say, GRAM jobmanagers, that was causing the load?

Is it possible that coaster settings caused too many GRAM jobs to be run?

I think we should do this:

- review the evidence to see if high coaster service CPU % was really 
observed
- if so, run the BLAST test elsewhere, and see if it causes such overhead
- run several tests of large numbers of sleep jobs, and record the 
coaster service CPU utilization. You could have the coaster service log 
its own CPU and memory utilization to the logfile every 60 seconds say. 
You could overload the #coasters per host (say to 16 or higher since 
they are sleep jobs), so you can readily do this on teraport.

 From such a plot, we could more scientifically see if we have a coaster 
service overhead problem or not.

In other words, the Ranger sysadmin did not say "your coaster process is 
consuming CPU", he just said your jobs are causing the login3 host to be 
slow for other users.

I think the only evidence that points to coasters is your observations, 
Allan, when you and Zhao were running. Is it possible that one of you 
was running the Swift command on login3 (in addition to its role as a 
gatekeeper host)?

Lets go back and review this carefully, as this sysadmin complaint has 
essentially shut down production coaster usage, which is bad, and we 
need to determine what was the real cause of the situation that caused 
the complaint.

- Mike

On 6/24/09 6:30 PM, Mihael Hategan wrote:
> Try the following:
> 
> In the source tree, edit
> cog/modules/provider-coaster/resources/log4j.properties and change the
> INFO categories to WARN.
> 
> Then re-compile and see if the usage is still high.
> 
> On Tue, 2009-06-23 at 17:11 -0500, Allan Espinosa wrote:
>> ok this looks like a good replicable case.
>>
>> the workflow is 2000 invocations of touch using 066-many.swift
>>
>> run04.tar.gz (logs and a "top -b"  dump for the coaster service).  cpu
>> utilization averages to 99-100% .   i ran this for five trials and got
>> the same results.
>>
>> run05.tar.gz - same run with profiling information. in java.hprof.bin
>>
>> 2009/6/19 Mihael Hategan <hategan at mcs.anl.gov>:
>>> On Fri, 2009-06-19 at 18:30 -0500, Allan Espinosa wrote:
>>>> here's a script recording without profiling of top:
>>>>
>>>> http://www.ci.uchicago.edu/~aespinosa/top_vanilla.gz
>>>>
>>>> it still consumes some cpu. but does not spike to 200% utilization.
>>> Right, 10% != 200%.
>>>
>>> Now, as you're probably already guessing, I would need:
>>> 1. a situation (workflow/site/etc.) in which the usage does go crazy
>>> without the profiler (as in what triggered you getting kicked off
>>> ranger); repeatable
>>> 2. a profiler dump of a run in such a situation
>>>
>>> Btw, the ranger issue, where was swift running?
>> on communicado.
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel