[Swift-devel] Support request: Swift jobs flooding uc-teragrid?

Mihael Hategan hategan at mcs.anl.gov
Fri Feb 1 19:07:02 CST 2008


Nice. I can see other people's jobs:

hategan at tg-grid1:~>
cat /soft/prews-gram-4.0.1-r3/tmp/gram_job_state/job.tg-grid1.uc.teragrid.org.9324.1184364728
https://tg-grid1.uc.teragrid.org:50170/9324/1184364728/
  12
 128
   0
1460443.tg-master.uc.teragrid.org
&(rsl_substitution=(GRIDMANAGER_GASS_URL
https://sidgrid.ci.uchicago.edu:60651))(executable='/home/skenny/vds_32/bin/kickstart')(directory='/home/skenny/sidgrid_out/skenny/skenny/wf_test/run0001')(arguments=-n upload::uploader -N sidgrid::UploadClient -R ANLUCTERAGRID32 /home/skenny/sidgrid/soft/upload/uploader skenny wf_test graspB.lh.forperm.txt_260.output graspB.lh.forperm.txt_261.output graspB.lh.forperm.txt_262.output graspB.lh.forperm.txt_263.output graspB.lh.forperm.txt_264.output graspB.lh.forperm.txt_265.output)(stderr=$(GLOBUS_CACHED_STDERR))(file_stage_out=($(GLOBUS_CACHED_STDERR) $(GRIDMANAGER_GASS_URL)#'/ci/sidgrid.ci.uchicago.edu/htdocs/sidgrid/sidgrid_test_server/sidgrid/transformations/sidgridUsers/skenny/wf_test/run0001/uploader_ID000007.err'))(environment=(app '/app/osg_app')(data '/home/skenny/data')(tmp '/tmp')(wntmp '/tmp'))(proxy_timeout=240)(save_state=yes)(two_phase=600)(remote_io_url=$(GRIDMANAGER_GASS_URL))(jobtype=single)(maxwalltime=2400)
https://tg-grid1.uc.teragrid.org:50170/9324/1184364728/
...


On Wed, 2008-01-30 at 21:35 -0600, Mihael Hategan wrote:
> Sure, I'd do that anyway to test the testing script(s)/process. I mean
> if I do mess it, I want to make sure I only need to do it once.
> 
> But I'm thinking it's better to agree on some time than for Joe or Ti or
> JP to randomly wonder what's going on.
> 
> On the other hand, seeing many processes in my name will probably
> eliminate the confusion :)
> 
> On Wed, 2008-01-30 at 21:23 -0600, Michael Wilde wrote:
> > I suggested we start the tests at a moderate intensity, and record the 
> > impact on CPU, mem, qlength, etc.
> > 
> > Then ramp up untl those indicators start to suggest that the gk is under 
> > strain.
> > 
> > Its not 100% foolproof, but better than blind stress testing.
> > 
> > - mike
> > 
> > 
> > On 1/30/08 7:15 PM, Mihael Hategan wrote:
> > > Me doing such tests will probably mess the gatekeeper node again. How do
> > > we proceed?
> > > 
> > > On Wed, 2008-01-30 at 13:19 -0600, Stuart Martin wrote:
> > >> I'm saying run swift tests using GRAM4 and see what you get.  Run a  
> > >> similar job scenario like 2000 jobs to the same GRAM4 service.  I will  
> > >> be interested to see how swift does for performance, scalability,  
> > >> errors...
> > >> It's possible that condor-g is not optimal, so seeing how another  
> > >> GRAM4 client dong similar job submission scenarios fares would make  
> > >> for an interesting comparison.
> > >>
> > >> -Stu
> > >>
> > >> On Jan 30, 2008, at Jan 30, 12:37 PM, Mihael Hategan wrote:
> > >>
> > >>> I'm confused. Why would you want to test GRAM scalability while
> > >>> introducing additional biasing elements, such as Condor-G?
> > >>>
> > >>> On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote:
> > >>>> All,
> > >>>>
> > >>>> I wanted to chime in with a number of things being discussed here.
> > >>>>
> > >>>> There is a GRAM RFT Core reliability group focused on ensuring the
> > >>>> GRAM service stays up and functional in spit of an onslaught from a
> > >>>> client.  http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team
> > >>>>
> > >>>> The ultimate goal here is that a client may get a timeout and that
> > >>>> would be the signal to backoff some.
> > >>>>
> > >>>> -----
> > >>>>
> > >>>> OSG - VO testing: We worked with Terrence (CMS) recently and here are
> > >>>> his test results.
> > >>>> 	http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests
> > >>>>
> > >>>> GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM
> > >>>> service better than GRAM4.  But again, this is with the condor-g
> > >>>> tricks.  Without the tricks, GRAM2 will handle the load better.
> > >>>>
> > >>>> OSG VTB testing: These were using globusrun-ws and also condor-g.
> > >>>> 	https://twiki.grid.iu.edu/twiki/bin/view/Integration/ 
> > >>>> WSGramValidation
> > >>>>
> > >>>> clients in these tests got a variety of errors depending on the jobs
> > >>>> run: timeouts, GridFTP authentication errors, client-side OOM, ...
> > >>>> GRAM4 functions pretty well, but it was not able to handle Terrence's
> > >>>> scenario.  But it handled 1000 jobs x 1 condor-g client just fine.
> > >>>>
> > >>>> -----
> > >>>>
> > >>>> It would be very interesting to see how swift does with GRAM4.  This
> > >>>> would make for a nice comparison to condor-g.
> > >>>>
> > >>>> As far as having functioning GRAM4 services on TG, things have
> > >>>> improved.  LEAD is using GRAM4 exclusively and we've been working  
> > >>>> with
> > >>>> them to make sure the GRAM4 services are up and functioning.  INCA  
> > >>>> has
> > >>>> been updated to more effectively test and monitor GRAM4 and GridFTP
> > >>>> services that LEAD is targeting.  This could be extended for any  
> > >>>> hosts
> > >>>> that swift would like to test against.  Here are some interesting
> > >>>> charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi
> > >>>>
> > >>>> -Stu
> > >>>>
> > >>>> On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote:
> > >>>>
> > >>>>> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote:
> > >>>>>
> > >>>>> [snip]
> > >>>>>
> > >>>>>> No. The default behaviour when working with a user who is "just
> > >>>>>> trying to
> > >>>>>> get their stuff to run" is "screw this, use GRAM2 because it  
> > >>>>>> works".
> > >>>>>>
> > >>>>>> Its a self-reinforcing feedback loop, that will be broken at the
> > >>>>>> point
> > >>>>>> that it becomes easier for people to stick with GRAM4 than default
> > >>>>>> back to
> > >>>>>> GRAM2. I guess we need to keep trying every now and then and hope
> > >>>>>> that one
> > >>>>>> time it sticks ;-)
> > >>>>>>
> > >>>>>> -- 
> > >>>>> Well this works to a point, but if falling back to a technology that
> > >>>>> is known to not be scalable for your sizes results in killing a
> > >>>>> machine, I, as a site admin, will eventually either a) deny you
> > >>>>> service b) shut down the poorly performing service or c) all of the
> > >>>>> above. So it's in your best interest to find and use those
> > >>>>> technologies that are best suited to the task at hand so the users
> > >>>>> of your software don't get nailed by (a).
> > >>>>>
> > >>>>> In this case it seems to me that using WS-GRAM, extending WS-GRAM
> > >>>>> and/or MDS to report site statistics, and/or modifying WS-GRAM to
> > >>>>> throttle itself (think of how apache reports "Server busy. Try again
> > >>>>> later") is the best path forward. For the short term, it seems that
> > >>>>> the Swift developers should manually find those limits for sites
> > >>>>> that the users use regularly for them to use, *and* educate their
> > >>>>> users on how to identify that they could be adversely affecting a
> > >>>>> resource and throttle themselves till the ideal, automated method is
> > >>>>> a usable reality.
> > >>>>>
> > >>>> _______________________________________________
> > >>>> Swift-devel mailing list
> > >>>> Swift-devel at ci.uchicago.edu
> > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >>>>
> > > 
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > 
> > > 
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list