[Swift-devel] Coaster CPU-time consumption issue

Michael Wilde wilde at mcs.anl.gov
Mon Jul 13 11:45:59 CDT 2009


I thought I wrote an email on this, but cant find it, so I will try to 
recall what I saw.

Sarah tried a test run to re-create the problem of "excessive overhead 
from coasters on the head node". This was spurred by another complaint 
from the Ranger sysadmins. The complaint had about the same level of 
detail as the first: it was voice mail saying "your processing are 
causing too much overhead on the login node".

So we tried to do a test to isolate and quantify what was happening. We 
did not get far enough, but got some initial observations.

Submitting from gwynn.bsd.uchicago.edu (I think) Sarah ran a workflow of 
50 sleep 300 jobs (approx).

This was around 7PM Thu night Jul 9. Sarah, are these logs still there? 
Can you copy the coaster and swift logs to the CI where we can look at them?

What I saw in top (-b -d) and ps was:

- two Java processes were created on login3 (headnode) with her ID
- one was about 275MB virt mem and burning 100% CPU time, continuously
- one was about 1GB virt mem and not burning much time
- tailing the coaster log in Sarah's home directory showed repetitive 
activity, seemingly about every second, a burst of "polling-like" messages
- seems like there were about 3-4 GRAM jobmanagers for the 50 jobs, 
which would be good, I think (in that it seems like jobs were allocated 
in blocks).

At the time we did not have a chance to gather detailed evidence, but I 
was surprised by two things:

- that there were two Java processes and that one was so big. (Are most 
likely the active process was just a child thread of the main process?)

- that there was continual log activity while the 50 jobs were sleeping. 
But I dont have solid evidence that the 50 jobs were actually running 
and sleeping.

I think if we correlate the swift log and the coaster log here we might 
  learn more.

I dont know if this was using Mihael's latest code with a reduced 
logging level or not.

Allan, this seems like it should be straightforward to reproduce now, so 
please go ahead and try to do that, and capture everything, including 
ideally the profile info that Mihael was trying to explain to Zhao how 
to capture.

- Mike






More information about the Swift-devel mailing list