[Swift-devel] Support request: Swift jobs flooding uc-teragrid?
Stuart Martin
smartin at mcs.anl.gov
Wed Jan 30 11:21:35 CST 2008
All,
I wanted to chime in with a number of things being discussed here.
There is a GRAM RFT Core reliability group focused on ensuring the
GRAM service stays up and functional in spit of an onslaught from a
client. http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team
The ultimate goal here is that a client may get a timeout and that
would be the signal to backoff some.
-----
OSG - VO testing: We worked with Terrence (CMS) recently and here are
his test results.
http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests
GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM
service better than GRAM4. But again, this is with the condor-g
tricks. Without the tricks, GRAM2 will handle the load better.
OSG VTB testing: These were using globusrun-ws and also condor-g.
https://twiki.grid.iu.edu/twiki/bin/view/Integration/WSGramValidation
clients in these tests got a variety of errors depending on the jobs
run: timeouts, GridFTP authentication errors, client-side OOM, ...
GRAM4 functions pretty well, but it was not able to handle Terrence's
scenario. But it handled 1000 jobs x 1 condor-g client just fine.
-----
It would be very interesting to see how swift does with GRAM4. This
would make for a nice comparison to condor-g.
As far as having functioning GRAM4 services on TG, things have
improved. LEAD is using GRAM4 exclusively and we've been working with
them to make sure the GRAM4 services are up and functioning. INCA has
been updated to more effectively test and monitor GRAM4 and GridFTP
services that LEAD is targeting. This could be extended for any hosts
that swift would like to test against. Here are some interesting
charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi
-Stu
On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote:
>
> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote:
>
> [snip]
>
>> No. The default behaviour when working with a user who is "just
>> trying to
>> get their stuff to run" is "screw this, use GRAM2 because it works".
>>
>> Its a self-reinforcing feedback loop, that will be broken at the
>> point
>> that it becomes easier for people to stick with GRAM4 than default
>> back to
>> GRAM2. I guess we need to keep trying every now and then and hope
>> that one
>> time it sticks ;-)
>>
>> --
>
> Well this works to a point, but if falling back to a technology that
> is known to not be scalable for your sizes results in killing a
> machine, I, as a site admin, will eventually either a) deny you
> service b) shut down the poorly performing service or c) all of the
> above. So it's in your best interest to find and use those
> technologies that are best suited to the task at hand so the users
> of your software don't get nailed by (a).
>
> In this case it seems to me that using WS-GRAM, extending WS-GRAM
> and/or MDS to report site statistics, and/or modifying WS-GRAM to
> throttle itself (think of how apache reports "Server busy. Try again
> later") is the best path forward. For the short term, it seems that
> the Swift developers should manually find those limits for sites
> that the users use regularly for them to use, *and* educate their
> users on how to identify that they could be adversely affecting a
> resource and throttle themselves till the ideal, automated method is
> a usable reality.
>
More information about the Swift-devel
mailing list