[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Mihael Hategan hategan at mcs.anl.gov
Wed Jan 30 21:35:53 CST 2008


Sure, I'd do that anyway to test the testing script(s)/process. I mean
if I do mess it, I want to make sure I only need to do it once.

But I'm thinking it's better to agree on some time than for Joe or Ti or
JP to randomly wonder what's going on.

On the other hand, seeing many processes in my name will probably
eliminate the confusion :)

On Wed, 2008-01-30 at 21:23 -0600, Michael Wilde wrote:
> I suggested we start the tests at a moderate intensity, and record the 
> impact on CPU, mem, qlength, etc.
> 
> Then ramp up untl those indicators start to suggest that the gk is under 
> strain.
> 
> Its not 100% foolproof, but better than blind stress testing.
> 
> - mike
> 
> 
> On 1/30/08 7:15 PM, Mihael Hategan wrote:
> > Me doing such tests will probably mess the gatekeeper node again. How do
> > we proceed?
> > 
> > On Wed, 2008-01-30 at 13:19 -0600, Stuart Martin wrote:
> >> I'm saying run swift tests using GRAM4 and see what you get.  Run a  
> >> similar job scenario like 2000 jobs to the same GRAM4 service.  I will  
> >> be interested to see how swift does for performance, scalability,  
> >> errors...
> >> It's possible that condor-g is not optimal, so seeing how another  
> >> GRAM4 client dong similar job submission scenarios fares would make  
> >> for an interesting comparison.
> >>
> >> -Stu
> >>
> >> On Jan 30, 2008, at Jan 30, 12:37 PM, Mihael Hategan wrote:
> >>
> >>> I'm confused. Why would you want to test GRAM scalability while
> >>> introducing additional biasing elements, such as Condor-G?
> >>>
> >>> On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote:
> >>>> All,
> >>>>
> >>>> I wanted to chime in with a number of things being discussed here.
> >>>>
> >>>> There is a GRAM RFT Core reliability group focused on ensuring the
> >>>> GRAM service stays up and functional in spit of an onslaught from a
> >>>> client.  http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team
> >>>>
> >>>> The ultimate goal here is that a client may get a timeout and that
> >>>> would be the signal to backoff some.
> >>>>
> >>>> -----
> >>>>
> >>>> OSG - VO testing: We worked with Terrence (CMS) recently and here are
> >>>> his test results.
> >>>> 	http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests
> >>>>
> >>>> GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM
> >>>> service better than GRAM4.  But again, this is with the condor-g
> >>>> tricks.  Without the tricks, GRAM2 will handle the load better.
> >>>>
> >>>> OSG VTB testing: These were using globusrun-ws and also condor-g.
> >>>> 	https://twiki.grid.iu.edu/twiki/bin/view/Integration/ 
> >>>> WSGramValidation
> >>>>
> >>>> clients in these tests got a variety of errors depending on the jobs
> >>>> run: timeouts, GridFTP authentication errors, client-side OOM, ...
> >>>> GRAM4 functions pretty well, but it was not able to handle Terrence's
> >>>> scenario.  But it handled 1000 jobs x 1 condor-g client just fine.
> >>>>
> >>>> -----
> >>>>
> >>>> It would be very interesting to see how swift does with GRAM4.  This
> >>>> would make for a nice comparison to condor-g.
> >>>>
> >>>> As far as having functioning GRAM4 services on TG, things have
> >>>> improved.  LEAD is using GRAM4 exclusively and we've been working  
> >>>> with
> >>>> them to make sure the GRAM4 services are up and functioning.  INCA  
> >>>> has
> >>>> been updated to more effectively test and monitor GRAM4 and GridFTP
> >>>> services that LEAD is targeting.  This could be extended for any  
> >>>> hosts
> >>>> that swift would like to test against.  Here are some interesting
> >>>> charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi
> >>>>
> >>>> -Stu
> >>>>
> >>>> On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote:
> >>>>
> >>>>> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote:
> >>>>>
> >>>>> [snip]
> >>>>>
> >>>>>> No. The default behaviour when working with a user who is "just
> >>>>>> trying to
> >>>>>> get their stuff to run" is "screw this, use GRAM2 because it  
> >>>>>> works".
> >>>>>>
> >>>>>> Its a self-reinforcing feedback loop, that will be broken at the
> >>>>>> point
> >>>>>> that it becomes easier for people to stick with GRAM4 than default
> >>>>>> back to
> >>>>>> GRAM2. I guess we need to keep trying every now and then and hope
> >>>>>> that one
> >>>>>> time it sticks ;-)
> >>>>>>
> >>>>>> -- 
> >>>>> Well this works to a point, but if falling back to a technology that
> >>>>> is known to not be scalable for your sizes results in killing a
> >>>>> machine, I, as a site admin, will eventually either a) deny you
> >>>>> service b) shut down the poorly performing service or c) all of the
> >>>>> above. So it's in your best interest to find and use those
> >>>>> technologies that are best suited to the task at hand so the users
> >>>>> of your software don't get nailed by (a).
> >>>>>
> >>>>> In this case it seems to me that using WS-GRAM, extending WS-GRAM
> >>>>> and/or MDS to report site statistics, and/or modifying WS-GRAM to
> >>>>> throttle itself (think of how apache reports "Server busy. Try again
> >>>>> later") is the best path forward. For the short term, it seems that
> >>>>> the Swift developers should manually find those limits for sites
> >>>>> that the users use regularly for them to use, *and* educate their
> >>>>> users on how to identify that they could be adversely affecting a
> >>>>> resource and throttle themselves till the ideal, automated method is
> >>>>> a usable reality.
> >>>>>
> >>>> _______________________________________________
> >>>> Swift-devel mailing list
> >>>> Swift-devel at ci.uchicago.edu
> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 




More information about the Swift-devel mailing list