[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Wed Jan 30 08:40:24 CST 2008

Here is something that might help Swift determine when the GRAM host is 
under heavy load, prior to things starting to fail. 

Could a simple service be made to run in the same container as the GRAM4 
service that would expose certain low level information, such as CPU 
utilization, machine load, memory free, swap used, disk I/O, network 
I/O, etc... If this is a standard service that exposes this information 
as RP, or even a simple status information WS function, then it could be 
used to determine the load on the machine where GRAM is running.  The 
tricky part is getting this kind of low level information in a platform 
independent fashion, but it might be worth the effort. 

BTW, I have done exactly this in the context of Falkon, to monitor the 
state of the machine where the Falkon service runs.  I actually start 
"vmstat" and scrape the output to get the needed information at regular 
intervals, and it works quite well on the few Linux distributions I 
tried it on, RH8, SuSe 9 and SuSe 10.

Ioan

Ben Clifford wrote:
> On Wed, 30 Jan 2008, Ti Leggett wrote:
>
>   
>> As a site admin I would rather you ramp up and not throttle down. Starting
>> high and working to a lower number means you could kill the machine many times
>> before you find the lower bound of what a site can handle. Starting slowly and
>> ramping up means you find that lower bound once. From my point of view, one
>> user consistently killing the resource can be turned off to prevent denial of
>> service to all other users *until* they can prove they won't kill the
>> resource. So I prefer the conservative.
>>     
>
> The code does ramp up at the moment, starting with 6 simultaneous jobs by 
> default.
>
> What doesn't happen very well at the moment is automated detection of 'too 
> much' in order to stop ramping up - the only really good feedback at the 
> moment (not just in this particular case but in other cases before) seems 
> to be a human being sitting in the feedback loop tweaking stuff.
>
> Two things we should work on are:
>  i) making it easier for the human who is sitting in that loop
> and
>  ii) figuring out a better way to get automated feedback.
>
> >From a TG-UC perspective, for example, what is a good way to know 'too 
> much'? Is it OK to keep submitting jobs until they start failing? Or 
> should there be some lower point at which we stop?
>
>   

-- 
==================================================
Ioan Raicu
Ph.D. Candidate
==================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
==================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS
==================================================
==================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080130/d09c6335/attachment.html>