[Swift-devel] Coaster CPU-time consumption issue

Mihael Hategan hategan at mcs.anl.gov
Mon Jul 13 12:04:02 CDT 2009


On Mon, 2009-07-13 at 11:45 -0500, Michael Wilde wrote:
> I thought I wrote an email on this, but cant find it, so I will try to 
> recall what I saw.
> 
> Sarah tried a test run to re-create the problem of "excessive overhead 
> from coasters on the head node". This was spurred by another complaint 
> from the Ranger sysadmins. The complaint had about the same level of 
> detail as the first: it was voice mail saying "your processing are 
> causing too much overhead on the login node".
> 
> So we tried to do a test to isolate and quantify what was happening. We 
> did not get far enough, but got some initial observations.
> 
> Submitting from gwynn.bsd.uchicago.edu (I think) Sarah ran a workflow of 
> 50 sleep 300 jobs (approx).
> 
> This was around 7PM Thu night Jul 9. Sarah, are these logs still there? 
> Can you copy the coaster and swift logs to the CI where we can look at them?
> 
> What I saw in top (-b -d) and ps was:
> 
> - two Java processes were created on login3 (headnode) with her ID
> - one was about 275MB virt mem and burning 100% CPU time, continuously
> - one was about 1GB virt mem and not burning much time
> - tailing the coaster log in Sarah's home directory showed repetitive 
> activity, seemingly about every second, a burst of "polling-like" messages
> - seems like there were about 3-4 GRAM jobmanagers for the 50 jobs, 
> which would be good, I think (in that it seems like jobs were allocated 
> in blocks).
> 
> At the time we did not have a chance to gather detailed evidence, but I 
> was surprised by two things:
> 
> - that there were two Java processes and that one was so big. (Are most 
> likely the active process was just a child thread of the main process?)

One java process is the bootstrap process (it downloads the coaster
jars, sets up the environment and runs the coaster service). It has
always been like this. Did you happen to capture the output of ps to a
file? That would be useful, because from what you are suggesting, it
appears that the bootstrap process is eating 100% CPU. That process
should only be sleeping after the service is started.

> 
> - that there was continual log activity

By some very odd definition of "continual". The schedule is re-computed
periodically. The messages also tell you how much time it takes to
re-compute the schedule, which divided by the pause interval should give
you the maximum CPU usage for the process for a time period, other
things ignored. In the idle state, this takes around 1ms (0.1% CPU
usage).

>  while the 50 jobs were sleeping. 
> But I dont have solid evidence that the 50 jobs were actually running 
> and sleeping.
> 
> I think if we correlate the swift log and the coaster log here we might 
>   learn more.
> 
> I dont know if this was using Mihael's latest code with a reduced 
> logging level or not.
> 
> Allan, this seems like it should be straightforward to reproduce now, so 
> please go ahead and try to do that, and capture everything, including 
> ideally the profile info that Mihael was trying to explain to Zhao how 
> to capture.
> 
> - Mike
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list