[Swift-devel] ws-gram tests

Fri Feb 8 09:46:42 CST 2008

On Fri, 2008-02-08 at 09:33 -0600, Stuart Martin wrote:
> Mihael,
> 
> Glad to hear things are improved with GRAM4.  Lets keep going to have  
> swift using GRAM4 routinely.

You're being a bit assertive there.

> 
> Below is a recent thread that looked at this exact issue with condor- 
> g.  But it is entirely relevant to your use of GRAM4. the 2 issues to  
> look for are
> 
> 1) your use of notifications
> 
> >> I ran one more test with the improved callback code. This time, I
> >> stopped storing the notification producer EPRs associated with the
> >> GRAM job resources. Memory usage went down markedly.
> 
> 2) you could avoid notifications and instead do client-side polling  
> for job state.  This has shown to be more reliable than notifications  
> under heavy loads, condor-g processing 1000s of jobs.

These are both hacks. I'm not sure I want to go there. 300K per job is a
bit too much considering that swift (which has to consider many more
things) has less than 10K overhead per job.

> 
> The core team will be looking at improving notifications once their  
> other 4.2 deliverables are done.
> 
> -Stu
> 
> Begin forwarded message:
> 
> > From: feller at mcs.anl.gov
> > Date: February 1, 2008 9:41:05 AM CST
> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin" <tmartin at physics.ucsd.edu 
> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon" <bacon at mcs.anl.gov 
> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner" <rwg at hep.uchicago.edu 
> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy" <roy at cs.wisc.edu>,  
> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny" <miron at cs.wisc.edu 
> > >
> > Subject: Re: Condor-G WS GRAM memory usage
> >
> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
> >>
> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
> >>>
> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
> >>>>
> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM
> >>>>> raised some concerns about memory usage on the client side. I did
> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which appeared
> >>>>> to be the primary memory consumer. The GAHP server is a wrapper
> >>>>> around the java client libraries for WS GRAM.
> >>>>>
> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a
> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
> >>>>> transfer. All of the jobs overlapped in submission and execution.
> >>>>> Here is what I've discovered so far.
> >>>>>
> >>>>> Aside from the heap available to the java code, the jvm used 117
> >>>>> megs of non-shared memory and 74 megs of shared memory. Condor-G
> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
> >>>>>
> >>>>> The maximum jvm heap usage (as reported by the garbage collector)
> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
> >>>>> quiescent (jobs executing, Condor-G waiting for them to complete),
> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
> >>>>>
> >>>>> The only long-term memory per job that I know of in the GAHP is
> >>>>> for the notification sink for job status callbacks. 600kb seems a
> >>>>> little high for that. Stu, could someone on Globus help us
> >>>>> determine if we're using the notification sinks inefficiently?
> >>>>
> >>>> Martin just looked and for the most part, there is nothing wrong
> >>>> with how condor-g manages the callback sink.
> >>>> However, one improvement that would reduce the memory used per job
> >>>> would be to not have a notification consumer per job.  Instead use
> >>>> one for all jobs.
> >>>>
> >>>> Also, Martin recently did some analysis on condor-g stress tests
> >>>> and found that notifications are building up on the in the GRAM4
> >>>> service container and that is causing delays which seem to be
> >>>> causing multiple problems.  We're looking at this in a separate
> >>>> effort with the GT Core team.  But, after this was clear, Martin  
> >>>> re-
> >>>> ran the condor-g test and relied on polling between condor-g and
> >>>> the GRAM4 service instead of notifications.  Jaime, could you
> >>>> repeat the no-notification test and see the difference in memory?
> >>>> The changes would be to increase the polling frequency in condor-g
> >>>> and comment out the subscribe for notification.  You could also
> >>>> comment out the notification listener call(s) too.
> >>>
> >>>
> >>> I did two new sets of tests today. The first used more efficient
> >>> callback code in the GAHP (one notification consumer rather than one
> >>> per job). The second disabled notifications and relied on polling
> >>> for job status changes.
> >>>
> >>> The more efficient callback code did not produce a noticeable
> >>> reduction in memory usage.
> >>>
> >>> Disabling notifications did reduce memory usage. The maximum jvm
> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum
> >>> heap usage after job submission and before job completion was about
> >>> 4 megs + 0.1 megs per job.
> >>
> >>
> >> I ran one more test with the improved callback code. This time, I
> >> stopped storing the notification producer EPRs associated with the
> >> GRAM job resources. Memory usage went down markedly.
> >>
> >> I was told the client had to explicitly destroy these serve-side
> >> notification producer resources when it destroys the job, otherwise
> >> they hang around bogging down the server. Is this still the case? The
> >> server can't destroy notification producers when their sources of
> >> information are destroyed?
> >>
> >
> > This reminds me of the odd fact that i had to suddenly grant much more
> > memory to Condor-g as soon as condor-g started storing EPRs of
> > subscription resources to be able to destroy them eventually.
> > Those EPR's are maybe not so tiny as they look like.
> >
> > For 4.0: yes, currently you'll have to store and eventually destroy
> > subscription resources manually to avoid heaping up persistence data
> > on the server-side.
> > For 4.2: no, you won't have to store them. A job resource will
> > destroy all subscription resources when it's destroyed.
> >
> > Overall i suggest to concentrate on 4.2 gram since the "container
> > hangs in job destruction" problem won't exist anymore.
> >
> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 changes
> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes  
> > sense
> > for us to do the 4.2-related changes in Gahp and hand it to you for
> > fine-tuning then?
> >
> > Martin
> 
> 
> 
> 
> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
> 
> > Mihael:
> >
> > That's great, thanks!
> >
> > Ian.
> >
> > Mihael Hategan wrote:
> >> I did a 1024 job run today with ws-gram.
> >> I painted the results here:
> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
> >>
> >> Seems like client memory per job is about 370k. Which is quite a lot.
> >> What kinda worries me is that it doesn't seem to go down after the  
> >> jobs
> >> are done, so maybe there's a memory leak, or maybe the garbage  
> >> collector
> >> doesn't do any major collections. I'll need to profile this to see
> >> exactly what we're talking about.
> >>
> >> The container memory is figured by looking at the process in /proc.  
> >> It's
> >> total memory including shared libraries and things. But libraries  
> >> take a
> >> fixed amount of space, so a fuzzy correlation can probably be made.  
> >> It
> >> looks quite similar to the amount of memory eaten on the client side
> >> (per job).
> >>
> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time  
> >> the
> >> jobs are submitted, but the machine itself seems responsive. I have  
> >> yet
> >> to plot the exact submission time for each job.
> >>
> >> So at this point I would recommend trying ws-gram as long as there
> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and  
> >> while
> >> making sure the jvm has enough heap. More than that seems like a  
> >> gamble.
> >>
> >> Mihael
> >>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> >>
> >
>