[Swift-devel] Coaster CPU-time consumption issue

Mon Jul 13 12:28:54 CDT 2009

On 7/13/09 12:04 PM, Mihael Hategan wrote:
> On Mon, 2009-07-13 at 11:45 -0500, Michael Wilde wrote:
>> I thought I wrote an email on this, but cant find it, so I will try to 
>> recall what I saw.
>>
>> Sarah tried a test run to re-create the problem of "excessive overhead 
>> from coasters on the head node". This was spurred by another complaint 
>> from the Ranger sysadmins. The complaint had about the same level of 
>> detail as the first: it was voice mail saying "your processing are 
>> causing too much overhead on the login node".
>>
>> So we tried to do a test to isolate and quantify what was happening. We 
>> did not get far enough, but got some initial observations.
>>
>> Submitting from gwynn.bsd.uchicago.edu (I think) Sarah ran a workflow of 
>> 50 sleep 300 jobs (approx).
>>
>> This was around 7PM Thu night Jul 9. Sarah, are these logs still there? 
>> Can you copy the coaster and swift logs to the CI where we can look at them?
>>
>> What I saw in top (-b -d) and ps was:
>>
>> - two Java processes were created on login3 (headnode) with her ID
>> - one was about 275MB virt mem and burning 100% CPU time, continuously
>> - one was about 1GB virt mem and not burning much time
>> - tailing the coaster log in Sarah's home directory showed repetitive 
>> activity, seemingly about every second, a burst of "polling-like" messages
>> - seems like there were about 3-4 GRAM jobmanagers for the 50 jobs, 
>> which would be good, I think (in that it seems like jobs were allocated 
>> in blocks).
>>
>> At the time we did not have a chance to gather detailed evidence, but I 
>> was surprised by two things:
>>
>> - that there were two Java processes and that one was so big. (Are most 
>> likely the active process was just a child thread of the main process?)
> 
> One java process is the bootstrap process (it downloads the coaster
> jars, sets up the environment and runs the coaster service). It has
> always been like this. Did you happen to capture the output of ps to a
> file? That would be useful, because from what you are suggesting, it
> appears that the bootstrap process is eating 100% CPU. That process
> should only be sleeping after the service is started.

I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant 
locate it.

As best as I can recall it showed the larger memory-footprint process to 
be relatively idle, and the smaller footprint process (about 275MB) to 
be burning 100% of a CPU.  Allan will try to get a snapshot of this shortly.

If this observation if correct, whats the best way to find out where its 
spinning? Profiling? Debug logging? Can you get profiling data from a 
JVM that doesnt exit?

- Mike

> 
>> - that there was continual log activity
> 
> By some very odd definition of "continual". The schedule is re-computed
> periodically. The messages also tell you how much time it takes to
> re-compute the schedule, which divided by the pause interval should give
> you the maximum CPU usage for the process for a time period, other
> things ignored. In the idle state, this takes around 1ms (0.1% CPU
> usage).
> 
>>  while the 50 jobs were sleeping. 
>> But I dont have solid evidence that the 50 jobs were actually running 
>> and sleeping.
>>
>> I think if we correlate the swift log and the coaster log here we might 
>>   learn more.
>>
>> I dont know if this was using Mihael's latest code with a reduced 
>> logging level or not.
>>
>> Allan, this seems like it should be straightforward to reproduce now, so 
>> please go ahead and try to do that, and capture everything, including 
>> ideally the profile info that Mihael was trying to explain to Zhao how 
>> to capture.
>>
>> - Mike
>>
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>