[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Mihael Hategan hategan at mcs.anl.gov
Wed Jan 30 19:15:09 CST 2008


Me doing such tests will probably mess the gatekeeper node again. How do
we proceed?

On Wed, 2008-01-30 at 13:19 -0600, Stuart Martin wrote:
> I'm saying run swift tests using GRAM4 and see what you get.  Run a  
> similar job scenario like 2000 jobs to the same GRAM4 service.  I will  
> be interested to see how swift does for performance, scalability,  
> errors...
> It's possible that condor-g is not optimal, so seeing how another  
> GRAM4 client dong similar job submission scenarios fares would make  
> for an interesting comparison.
> 
> -Stu
> 
> On Jan 30, 2008, at Jan 30, 12:37 PM, Mihael Hategan wrote:
> 
> > I'm confused. Why would you want to test GRAM scalability while
> > introducing additional biasing elements, such as Condor-G?
> >
> > On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote:
> >> All,
> >>
> >> I wanted to chime in with a number of things being discussed here.
> >>
> >> There is a GRAM RFT Core reliability group focused on ensuring the
> >> GRAM service stays up and functional in spit of an onslaught from a
> >> client.  http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team
> >>
> >> The ultimate goal here is that a client may get a timeout and that
> >> would be the signal to backoff some.
> >>
> >> -----
> >>
> >> OSG - VO testing: We worked with Terrence (CMS) recently and here are
> >> his test results.
> >> 	http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests
> >>
> >> GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM
> >> service better than GRAM4.  But again, this is with the condor-g
> >> tricks.  Without the tricks, GRAM2 will handle the load better.
> >>
> >> OSG VTB testing: These were using globusrun-ws and also condor-g.
> >> 	https://twiki.grid.iu.edu/twiki/bin/view/Integration/ 
> >> WSGramValidation
> >>
> >> clients in these tests got a variety of errors depending on the jobs
> >> run: timeouts, GridFTP authentication errors, client-side OOM, ...
> >> GRAM4 functions pretty well, but it was not able to handle Terrence's
> >> scenario.  But it handled 1000 jobs x 1 condor-g client just fine.
> >>
> >> -----
> >>
> >> It would be very interesting to see how swift does with GRAM4.  This
> >> would make for a nice comparison to condor-g.
> >>
> >> As far as having functioning GRAM4 services on TG, things have
> >> improved.  LEAD is using GRAM4 exclusively and we've been working  
> >> with
> >> them to make sure the GRAM4 services are up and functioning.  INCA  
> >> has
> >> been updated to more effectively test and monitor GRAM4 and GridFTP
> >> services that LEAD is targeting.  This could be extended for any  
> >> hosts
> >> that swift would like to test against.  Here are some interesting
> >> charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi
> >>
> >> -Stu
> >>
> >> On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote:
> >>
> >>>
> >>> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote:
> >>>
> >>> [snip]
> >>>
> >>>> No. The default behaviour when working with a user who is "just
> >>>> trying to
> >>>> get their stuff to run" is "screw this, use GRAM2 because it  
> >>>> works".
> >>>>
> >>>> Its a self-reinforcing feedback loop, that will be broken at the
> >>>> point
> >>>> that it becomes easier for people to stick with GRAM4 than default
> >>>> back to
> >>>> GRAM2. I guess we need to keep trying every now and then and hope
> >>>> that one
> >>>> time it sticks ;-)
> >>>>
> >>>> -- 
> >>>
> >>> Well this works to a point, but if falling back to a technology that
> >>> is known to not be scalable for your sizes results in killing a
> >>> machine, I, as a site admin, will eventually either a) deny you
> >>> service b) shut down the poorly performing service or c) all of the
> >>> above. So it's in your best interest to find and use those
> >>> technologies that are best suited to the task at hand so the users
> >>> of your software don't get nailed by (a).
> >>>
> >>> In this case it seems to me that using WS-GRAM, extending WS-GRAM
> >>> and/or MDS to report site statistics, and/or modifying WS-GRAM to
> >>> throttle itself (think of how apache reports "Server busy. Try again
> >>> later") is the best path forward. For the short term, it seems that
> >>> the Swift developers should manually find those limits for sites
> >>> that the users use regularly for them to use, *and* educate their
> >>> users on how to identify that they could be adversely affecting a
> >>> resource and throttle themselves till the ideal, automated method is
> >>> a usable reality.
> >>>
> >>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> >
> 




More information about the Swift-devel mailing list