[Swift-devel] Support request: Swift jobs flooding uc-teragrid?
Michael Wilde
wilde at mcs.anl.gov
Wed Jan 30 21:23:56 CST 2008
I suggested we start the tests at a moderate intensity, and record the
impact on CPU, mem, qlength, etc.
Then ramp up untl those indicators start to suggest that the gk is under
strain.
Its not 100% foolproof, but better than blind stress testing.
- mike
On 1/30/08 7:15 PM, Mihael Hategan wrote:
> Me doing such tests will probably mess the gatekeeper node again. How do
> we proceed?
>
> On Wed, 2008-01-30 at 13:19 -0600, Stuart Martin wrote:
>> I'm saying run swift tests using GRAM4 and see what you get. Run a
>> similar job scenario like 2000 jobs to the same GRAM4 service. I will
>> be interested to see how swift does for performance, scalability,
>> errors...
>> It's possible that condor-g is not optimal, so seeing how another
>> GRAM4 client dong similar job submission scenarios fares would make
>> for an interesting comparison.
>>
>> -Stu
>>
>> On Jan 30, 2008, at Jan 30, 12:37 PM, Mihael Hategan wrote:
>>
>>> I'm confused. Why would you want to test GRAM scalability while
>>> introducing additional biasing elements, such as Condor-G?
>>>
>>> On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote:
>>>> All,
>>>>
>>>> I wanted to chime in with a number of things being discussed here.
>>>>
>>>> There is a GRAM RFT Core reliability group focused on ensuring the
>>>> GRAM service stays up and functional in spit of an onslaught from a
>>>> client. http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team
>>>>
>>>> The ultimate goal here is that a client may get a timeout and that
>>>> would be the signal to backoff some.
>>>>
>>>> -----
>>>>
>>>> OSG - VO testing: We worked with Terrence (CMS) recently and here are
>>>> his test results.
>>>> http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests
>>>>
>>>> GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM
>>>> service better than GRAM4. But again, this is with the condor-g
>>>> tricks. Without the tricks, GRAM2 will handle the load better.
>>>>
>>>> OSG VTB testing: These were using globusrun-ws and also condor-g.
>>>> https://twiki.grid.iu.edu/twiki/bin/view/Integration/
>>>> WSGramValidation
>>>>
>>>> clients in these tests got a variety of errors depending on the jobs
>>>> run: timeouts, GridFTP authentication errors, client-side OOM, ...
>>>> GRAM4 functions pretty well, but it was not able to handle Terrence's
>>>> scenario. But it handled 1000 jobs x 1 condor-g client just fine.
>>>>
>>>> -----
>>>>
>>>> It would be very interesting to see how swift does with GRAM4. This
>>>> would make for a nice comparison to condor-g.
>>>>
>>>> As far as having functioning GRAM4 services on TG, things have
>>>> improved. LEAD is using GRAM4 exclusively and we've been working
>>>> with
>>>> them to make sure the GRAM4 services are up and functioning. INCA
>>>> has
>>>> been updated to more effectively test and monitor GRAM4 and GridFTP
>>>> services that LEAD is targeting. This could be extended for any
>>>> hosts
>>>> that swift would like to test against. Here are some interesting
>>>> charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi
>>>>
>>>> -Stu
>>>>
>>>> On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote:
>>>>
>>>>> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>> No. The default behaviour when working with a user who is "just
>>>>>> trying to
>>>>>> get their stuff to run" is "screw this, use GRAM2 because it
>>>>>> works".
>>>>>>
>>>>>> Its a self-reinforcing feedback loop, that will be broken at the
>>>>>> point
>>>>>> that it becomes easier for people to stick with GRAM4 than default
>>>>>> back to
>>>>>> GRAM2. I guess we need to keep trying every now and then and hope
>>>>>> that one
>>>>>> time it sticks ;-)
>>>>>>
>>>>>> --
>>>>> Well this works to a point, but if falling back to a technology that
>>>>> is known to not be scalable for your sizes results in killing a
>>>>> machine, I, as a site admin, will eventually either a) deny you
>>>>> service b) shut down the poorly performing service or c) all of the
>>>>> above. So it's in your best interest to find and use those
>>>>> technologies that are best suited to the task at hand so the users
>>>>> of your software don't get nailed by (a).
>>>>>
>>>>> In this case it seems to me that using WS-GRAM, extending WS-GRAM
>>>>> and/or MDS to report site statistics, and/or modifying WS-GRAM to
>>>>> throttle itself (think of how apache reports "Server busy. Try again
>>>>> later") is the best path forward. For the short term, it seems that
>>>>> the Swift developers should manually find those limits for sites
>>>>> that the users use regularly for them to use, *and* educate their
>>>>> users on how to identify that they could be adversely affecting a
>>>>> resource and throttle themselves till the ideal, automated method is
>>>>> a usable reality.
>>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
More information about the Swift-devel
mailing list