[Swift-devel] hprof profiling of coaster services

Fri Jun 26 10:55:44 CDT 2009

Below is the response from the Ranger Sysadmin.

Allan, can you both resume running Zhao's AMPL script on Ranger? I would 
do this:

- watch the AMPL script closely, see what the overhead is for both the 
Swift JVM and the coaster service JVM (I'd watch CPU and memory, using 
"top" if thats easiest, although a tool that gives a plot of the 
resources over time would be nice; I suspect such tools exist). You 
could sample /proc and plot if nothing else.

- if you dont see the problem, lets simply keep an eye on things.

- if you do see the problem, I suggest we go back to measuring on a less 
visible machine, like teraport.

Im also curious as to what the overhead for logging is, and whether to 
start back on Ranger using the original log settings or the log level 
reduced to WARN as Mihael suggested.

It certainly possible as John suggests that the excessive load occurred 
when the coaster server encouterred a problem, rather than during normal 
operation. That would be very visile in a plot of its CPU usage over 
time - we'd see it spike up after running OK for a while.

Other/better suggestions welcome.

- Mike

-------- Original Message --------
Subject: Re: Ranger @ TACC - Jobs Running On Head Node creating heavy load
Date: Fri, 26 Jun 2009 11:41:03 -0400
From: John Lockman <jlockman at tacc.utexas.edu>
To: Michael Wilde <wilde at mcs.anl.gov>
References: <1245269425.13629.23.camel at lockman-d630.tacc.utexas.edu>	 
<4A44DD70.2080207 at mcs.anl.gov>

Mike,

It is still unclear as to why your java processes were chewing up so
much time and CPU resources, we have a couple of other folks who do
similar activities monitoring jobs and they don't seem to trigger such a
load.

I have a feeling your code may not be cleaning up the processes after
something maybe goes wrong and then the java process goes spinning out
of control.

If you would like to begin testing again, it will be okay on login3 for
now.

Also, we are investigating adding additional system resources to Ranger
to better support these types of activities and move some of the globus
workload off of the login nodes.

Cheers!

-- 
John Lockman III
High Performance Computing
+1.512.471.4097
ROC 1.428
Texas Advanced Computing Center
The University of Texas at Austin

On 6/26/09 9:33 AM, Mihael Hategan wrote:
> On Fri, 2009-06-26 at 09:29 -0500, Michael Wilde wrote:
>> I will try, and cc the list. Its not clear that John knows more than he 
>> reported - most likely he saw processes owned by Zhao and Allan at the 
>> top of "top", and concluded that they were runing application tests on 
>> the login hosts. But worth a try.
> 
> Well, we keep guessing while there was a specific thing that triggered
> the email.
> 
>