[Swift-devel] Coaster CPU-time consumption issue
Michael Wilde
wilde at mcs.anl.gov
Mon Jul 13 12:28:54 CDT 2009
On 7/13/09 12:04 PM, Mihael Hategan wrote:
> On Mon, 2009-07-13 at 11:45 -0500, Michael Wilde wrote:
>> I thought I wrote an email on this, but cant find it, so I will try to
>> recall what I saw.
>>
>> Sarah tried a test run to re-create the problem of "excessive overhead
>> from coasters on the head node". This was spurred by another complaint
>> from the Ranger sysadmins. The complaint had about the same level of
>> detail as the first: it was voice mail saying "your processing are
>> causing too much overhead on the login node".
>>
>> So we tried to do a test to isolate and quantify what was happening. We
>> did not get far enough, but got some initial observations.
>>
>> Submitting from gwynn.bsd.uchicago.edu (I think) Sarah ran a workflow of
>> 50 sleep 300 jobs (approx).
>>
>> This was around 7PM Thu night Jul 9. Sarah, are these logs still there?
>> Can you copy the coaster and swift logs to the CI where we can look at them?
>>
>> What I saw in top (-b -d) and ps was:
>>
>> - two Java processes were created on login3 (headnode) with her ID
>> - one was about 275MB virt mem and burning 100% CPU time, continuously
>> - one was about 1GB virt mem and not burning much time
>> - tailing the coaster log in Sarah's home directory showed repetitive
>> activity, seemingly about every second, a burst of "polling-like" messages
>> - seems like there were about 3-4 GRAM jobmanagers for the 50 jobs,
>> which would be good, I think (in that it seems like jobs were allocated
>> in blocks).
>>
>> At the time we did not have a chance to gather detailed evidence, but I
>> was surprised by two things:
>>
>> - that there were two Java processes and that one was so big. (Are most
>> likely the active process was just a child thread of the main process?)
>
> One java process is the bootstrap process (it downloads the coaster
> jars, sets up the environment and runs the coaster service). It has
> always been like this. Did you happen to capture the output of ps to a
> file? That would be useful, because from what you are suggesting, it
> appears that the bootstrap process is eating 100% CPU. That process
> should only be sleeping after the service is started.
I *thought* I captured the output of "top -u sarahs'id -b -d" but I cant
locate it.
As best as I can recall it showed the larger memory-footprint process to
be relatively idle, and the smaller footprint process (about 275MB) to
be burning 100% of a CPU. Allan will try to get a snapshot of this shortly.
If this observation if correct, whats the best way to find out where its
spinning? Profiling? Debug logging? Can you get profiling data from a
JVM that doesnt exit?
- Mike
>
>> - that there was continual log activity
>
> By some very odd definition of "continual". The schedule is re-computed
> periodically. The messages also tell you how much time it takes to
> re-compute the schedule, which divided by the pause interval should give
> you the maximum CPU usage for the process for a time period, other
> things ignored. In the idle state, this takes around 1ms (0.1% CPU
> usage).
>
>> while the 50 jobs were sleeping.
>> But I dont have solid evidence that the 50 jobs were actually running
>> and sleeping.
>>
>> I think if we correlate the swift log and the coaster log here we might
>> learn more.
>>
>> I dont know if this was using Mihael's latest code with a reduced
>> logging level or not.
>>
>> Allan, this seems like it should be straightforward to reproduce now, so
>> please go ahead and try to do that, and capture everything, including
>> ideally the profile info that Mihael was trying to explain to Zhao how
>> to capture.
>>
>> - Mike
>>
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
More information about the Swift-devel
mailing list