[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Stuart Martin smartin at mcs.anl.gov
Wed Jan 30 13:19:47 CST 2008


I'm saying run swift tests using GRAM4 and see what you get.  Run a  
similar job scenario like 2000 jobs to the same GRAM4 service.  I will  
be interested to see how swift does for performance, scalability,  
errors...
It's possible that condor-g is not optimal, so seeing how another  
GRAM4 client dong similar job submission scenarios fares would make  
for an interesting comparison.

-Stu

On Jan 30, 2008, at Jan 30, 12:37 PM, Mihael Hategan wrote:

> I'm confused. Why would you want to test GRAM scalability while
> introducing additional biasing elements, such as Condor-G?
>
> On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote:
>> All,
>>
>> I wanted to chime in with a number of things being discussed here.
>>
>> There is a GRAM RFT Core reliability group focused on ensuring the
>> GRAM service stays up and functional in spit of an onslaught from a
>> client.  http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team
>>
>> The ultimate goal here is that a client may get a timeout and that
>> would be the signal to backoff some.
>>
>> -----
>>
>> OSG - VO testing: We worked with Terrence (CMS) recently and here are
>> his test results.
>> 	http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests
>>
>> GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM
>> service better than GRAM4.  But again, this is with the condor-g
>> tricks.  Without the tricks, GRAM2 will handle the load better.
>>
>> OSG VTB testing: These were using globusrun-ws and also condor-g.
>> 	https://twiki.grid.iu.edu/twiki/bin/view/Integration/ 
>> WSGramValidation
>>
>> clients in these tests got a variety of errors depending on the jobs
>> run: timeouts, GridFTP authentication errors, client-side OOM, ...
>> GRAM4 functions pretty well, but it was not able to handle Terrence's
>> scenario.  But it handled 1000 jobs x 1 condor-g client just fine.
>>
>> -----
>>
>> It would be very interesting to see how swift does with GRAM4.  This
>> would make for a nice comparison to condor-g.
>>
>> As far as having functioning GRAM4 services on TG, things have
>> improved.  LEAD is using GRAM4 exclusively and we've been working  
>> with
>> them to make sure the GRAM4 services are up and functioning.  INCA  
>> has
>> been updated to more effectively test and monitor GRAM4 and GridFTP
>> services that LEAD is targeting.  This could be extended for any  
>> hosts
>> that swift would like to test against.  Here are some interesting
>> charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi
>>
>> -Stu
>>
>> On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote:
>>
>>>
>>> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote:
>>>
>>> [snip]
>>>
>>>> No. The default behaviour when working with a user who is "just
>>>> trying to
>>>> get their stuff to run" is "screw this, use GRAM2 because it  
>>>> works".
>>>>
>>>> Its a self-reinforcing feedback loop, that will be broken at the
>>>> point
>>>> that it becomes easier for people to stick with GRAM4 than default
>>>> back to
>>>> GRAM2. I guess we need to keep trying every now and then and hope
>>>> that one
>>>> time it sticks ;-)
>>>>
>>>> -- 
>>>
>>> Well this works to a point, but if falling back to a technology that
>>> is known to not be scalable for your sizes results in killing a
>>> machine, I, as a site admin, will eventually either a) deny you
>>> service b) shut down the poorly performing service or c) all of the
>>> above. So it's in your best interest to find and use those
>>> technologies that are best suited to the task at hand so the users
>>> of your software don't get nailed by (a).
>>>
>>> In this case it seems to me that using WS-GRAM, extending WS-GRAM
>>> and/or MDS to report site statistics, and/or modifying WS-GRAM to
>>> throttle itself (think of how apache reports "Server busy. Try again
>>> later") is the best path forward. For the short term, it seems that
>>> the Swift developers should manually find those limits for sites
>>> that the users use regularly for them to use, *and* educate their
>>> users on how to identify that they could be adversely affecting a
>>> resource and throttle themselves till the ideal, automated method is
>>> a usable reality.
>>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>




More information about the Swift-devel mailing list