[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Wed Jan 30 09:09:57 CST 2008

We still need to agree on what are ok numbers and what are not ok
numbers.

We should perhaps find that first and then talk about the
implementation.

On Wed, 2008-01-30 at 08:40 -0600, Ioan Raicu wrote:
> Here is something that might help Swift determine when the GRAM host
> is under heavy load, prior to things starting to fail.  
> 
> Could a simple service be made to run in the same container as the
> GRAM4 service that would expose certain low level information, such as
> CPU utilization, machine load, memory free, swap used, disk I/O,
> network I/O, etc... If this is a standard service that exposes this
> information as RP, or even a simple status information WS function,
> then it could be used to determine the load on the machine where GRAM
> is running.  The tricky part is getting this kind of low level
> information in a platform independent fashion, but it might be worth
> the effort.  
> 
> BTW, I have done exactly this in the context of Falkon, to monitor the
> state of the machine where the Falkon service runs.  I actually start
> "vmstat" and scrape the output to get the needed information at
> regular intervals, and it works quite well on the few Linux
> distributions I tried it on, RH8, SuSe 9 and SuSe 10.
> 
> Ioan
> 
> Ben Clifford wrote: 
> > On Wed, 30 Jan 2008, Ti Leggett wrote:
> > 
> >   
> > > As a site admin I would rather you ramp up and not throttle down. Starting
> > > high and working to a lower number means you could kill the machine many times
> > > before you find the lower bound of what a site can handle. Starting slowly and
> > > ramping up means you find that lower bound once. From my point of view, one
> > > user consistently killing the resource can be turned off to prevent denial of
> > > service to all other users *until* they can prove they won't kill the
> > > resource. So I prefer the conservative.
> > >     
> > 
> > The code does ramp up at the moment, starting with 6 simultaneous jobs by 
> > default.
> > 
> > What doesn't happen very well at the moment is automated detection of 'too 
> > much' in order to stop ramping up - the only really good feedback at the 
> > moment (not just in this particular case but in other cases before) seems 
> > to be a human being sitting in the feedback loop tweaking stuff.
> > 
> > Two things we should work on are:
> >  i) making it easier for the human who is sitting in that loop
> > and
> >  ii) figuring out a better way to get automated feedback.
> > 
> > >From a TG-UC perspective, for example, what is a good way to know 'too 
> > much'? Is it OK to keep submitting jobs until they start failing? Or 
> > should there be some lower point at which we stop?
> > 
> >   
> 
> -- 
> ==================================================
> Ioan Raicu
> Ph.D. Candidate
> ==================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ==================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS
> ==================================================
> ==================================================
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel