<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Here is something that might help Swift determine when the GRAM host is

under heavy load, prior to things starting to fail.  <br>

<br>

Could a simple service be made to run in the same container as the

GRAM4 service that would expose certain low level information, such as

CPU utilization, machine load, memory free, swap used, disk I/O,

network I/O, etc... If this is a standard service that exposes this

information as RP, or even a simple status information WS function,

then it could be used to determine the load on the machine where GRAM

is running.  The tricky part is getting this kind of low level

information in a platform independent fashion, but it might be worth

the effort.  <br>

<br>

BTW, I have done exactly this in the context of Falkon, to monitor the

state of the machine where the Falkon service runs.  I actually start

"vmstat" and scrape the output to get the needed information at regular

intervals, and it works quite well on the few Linux distributions I

tried it on, RH8, SuSe 9 and SuSe 10.<br>

<br>

Ioan<br>

<br>

Ben Clifford wrote:

<blockquote

 cite="mid:Pine.LNX.4.64.0801301342300.6302@dildano.hawaga.org.uk"

 type="cite">

  <pre wrap="">

On Wed, 30 Jan 2008, Ti Leggett wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">As a site admin I would rather you ramp up and not throttle down. Starting

high and working to a lower number means you could kill the machine many times

before you find the lower bound of what a site can handle. Starting slowly and

ramping up means you find that lower bound once. From my point of view, one

user consistently killing the resource can be turned off to prevent denial of

service to all other users *until* they can prove they won't kill the

resource. So I prefer the conservative.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

The code does ramp up at the moment, starting with 6 simultaneous jobs by 

default.

What doesn't happen very well at the moment is automated detection of 'too 

much' in order to stop ramping up - the only really good feedback at the 

moment (not just in this particular case but in other cases before) seems 

to be a human being sitting in the feedback loop tweaking stuff.

Two things we should work on are:

 i) making it easier for the human who is sitting in that loop

and

 ii) figuring out a better way to get automated feedback.

>From a TG-UC perspective, for example, what is a good way to know 'too 

much'? Is it OK to keep submitting jobs until they start failing? Or 

should there be some lower point at which we stop?

  </pre>

</blockquote>

<br>

<pre class="moz-signature" cols="72">-- 

==================================================

Ioan Raicu

Ph.D. Candidate

==================================================

Distributed Systems Laboratory

Computer Science Department

University of Chicago

1100 E. 58th Street, Ryerson Hall

Chicago, IL 60637

==================================================

Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@cs.uchicago.edu">iraicu@cs.uchicago.edu</a>

Web:   <a class="moz-txt-link-freetext" href="http://www.cs.uchicago.edu/~iraicu">http://www.cs.uchicago.edu/~iraicu</a>

<a class="moz-txt-link-freetext" href="http://dev.globus.org/wiki/Incubator/Falkon">http://dev.globus.org/wiki/Incubator/Falkon</a>

<a class="moz-txt-link-freetext" href="http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS">http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS</a>

==================================================

==================================================

</pre>

</body>

</html>