[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Wed Jan 30 11:21:35 CST 2008

All,

I wanted to chime in with a number of things being discussed here.

There is a GRAM RFT Core reliability group focused on ensuring the  
GRAM service stays up and functional in spit of an onslaught from a  
client.  http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team

The ultimate goal here is that a client may get a timeout and that  
would be the signal to backoff some.

-----

OSG - VO testing: We worked with Terrence (CMS) recently and here are  
his test results.
	http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests

GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM  
service better than GRAM4.  But again, this is with the condor-g  
tricks.  Without the tricks, GRAM2 will handle the load better.

OSG VTB testing: These were using globusrun-ws and also condor-g.
	https://twiki.grid.iu.edu/twiki/bin/view/Integration/WSGramValidation

clients in these tests got a variety of errors depending on the jobs  
run: timeouts, GridFTP authentication errors, client-side OOM, ...   
GRAM4 functions pretty well, but it was not able to handle Terrence's  
scenario.  But it handled 1000 jobs x 1 condor-g client just fine.

-----

It would be very interesting to see how swift does with GRAM4.  This  
would make for a nice comparison to condor-g.

As far as having functioning GRAM4 services on TG, things have  
improved.  LEAD is using GRAM4 exclusively and we've been working with  
them to make sure the GRAM4 services are up and functioning.  INCA has  
been updated to more effectively test and monitor GRAM4 and GridFTP  
services that LEAD is targeting.  This could be extended for any hosts  
that swift would like to test against.  Here are some interesting  
charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi

-Stu

On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote:

>
> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote:
>
> [snip]
>
>> No. The default behaviour when working with a user who is "just  
>> trying to
>> get their stuff to run" is "screw this, use GRAM2 because it works".
>>
>> Its a self-reinforcing feedback loop, that will be broken at the  
>> point
>> that it becomes easier for people to stick with GRAM4 than default  
>> back to
>> GRAM2. I guess we need to keep trying every now and then and hope  
>> that one
>> time it sticks ;-)
>>
>> -- 
>
> Well this works to a point, but if falling back to a technology that  
> is known to not be scalable for your sizes results in killing a  
> machine, I, as a site admin, will eventually either a) deny you  
> service b) shut down the poorly performing service or c) all of the  
> above. So it's in your best interest to find and use those  
> technologies that are best suited to the task at hand so the users  
> of your software don't get nailed by (a).
>
> In this case it seems to me that using WS-GRAM, extending WS-GRAM  
> and/or MDS to report site statistics, and/or modifying WS-GRAM to  
> throttle itself (think of how apache reports "Server busy. Try again  
> later") is the best path forward. For the short term, it seems that  
> the Swift developers should manually find those limits for sites  
> that the users use regularly for them to use, *and* educate their  
> users on how to identify that they could be adversely affecting a  
> resource and throttle themselves till the ideal, automated method is  
> a usable reality.
>