[Swift-devel] ws-gram tests

Fri Feb 8 11:16:30 CST 2008

> Shoot! i didn't know that and thought there would be a container per
> GramJob in that case.

Yep. There was even a bug, not sure if it was fixed, that would mess up
the port for that container on subsequent requests (basically a second
sequential job would start the container on 8443 instead of whatever was
in the port range).

>  That's the core mysteries with notifications.
> Anyway: I did a quick check some days ago and found that GramJob is
> surprisingly greedy regarding memory as you said. I'll have to further
> check what it is, but will probably not do that before 4.2 is out.

I'll try to profile it today. You should get a license for YJP so that
you can look at the snapshots I might come up with. It's free for OSS
projects (just point them to the globus page that has your name).

> 
> 
> > Due to the above, a reference to the GramJob is kept anyway, regardless
> > of whether that reference is in client code or the local container.
> >
> > I'll try to profile a run and see if I can spot where the problems are.
> >
> >>
> >> Martin
> >>
> >> >>
> >> >> The core team will be looking at improving notifications once their
> >> >> other 4.2 deliverables are done.
> >> >>
> >> >> -Stu
> >> >>
> >> >> Begin forwarded message:
> >> >>
> >> >> > From: feller at mcs.anl.gov
> >> >> > Date: February 1, 2008 9:41:05 AM CST
> >> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
> >> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
> >> >> <tmartin at physics.ucsd.edu
> >> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
> >> >> <bacon at mcs.anl.gov
> >> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
> >> >> <rwg at hep.uchicago.edu
> >> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy" <roy at cs.wisc.edu>,
> >> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
> >> >> <miron at cs.wisc.edu
> >> >> > >
> >> >> > Subject: Re: Condor-G WS GRAM memory usage
> >> >> >
> >> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
> >> >> >>
> >> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
> >> >> >>>
> >> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
> >> >> >>>>
> >> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM
> >> >> >>>>> raised some concerns about memory usage on the client side. I
> >> did
> >> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
> >> appeared
> >> >> >>>>> to be the primary memory consumer. The GAHP server is a wrapper
> >> >> >>>>> around the java client libraries for WS GRAM.
> >> >> >>>>>
> >> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a
> >> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
> >> >> >>>>> transfer. All of the jobs overlapped in submission and
> >> execution.
> >> >> >>>>> Here is what I've discovered so far.
> >> >> >>>>>
> >> >> >>>>> Aside from the heap available to the java code, the jvm used
> >> 117
> >> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
> >> Condor-G
> >> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
> >> >> >>>>>
> >> >> >>>>> The maximum jvm heap usage (as reported by the garbage
> >> collector)
> >> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
> >> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
> >> complete),
> >> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
> >> >> >>>>>
> >> >> >>>>> The only long-term memory per job that I know of in the GAHP is
> >> >> >>>>> for the notification sink for job status callbacks. 600kb seems
> >> a
> >> >> >>>>> little high for that. Stu, could someone on Globus help us
> >> >> >>>>> determine if we're using the notification sinks inefficiently?
> >> >> >>>>
> >> >> >>>> Martin just looked and for the most part, there is nothing wrong
> >> >> >>>> with how condor-g manages the callback sink.
> >> >> >>>> However, one improvement that would reduce the memory used per
> >> job
> >> >> >>>> would be to not have a notification consumer per job.  Instead
> >> use
> >> >> >>>> one for all jobs.
> >> >> >>>>
> >> >> >>>> Also, Martin recently did some analysis on condor-g stress tests
> >> >> >>>> and found that notifications are building up on the in the GRAM4
> >> >> >>>> service container and that is causing delays which seem to be
> >> >> >>>> causing multiple problems.  We're looking at this in a separate
> >> >> >>>> effort with the GT Core team.  But, after this was clear, Martin
> >> >> >>>> re-
> >> >> >>>> ran the condor-g test and relied on polling between condor-g and
> >> >> >>>> the GRAM4 service instead of notifications.  Jaime, could you
> >> >> >>>> repeat the no-notification test and see the difference in
> >> memory?
> >> >> >>>> The changes would be to increase the polling frequency in
> >> condor-g
> >> >> >>>> and comment out the subscribe for notification.  You could also
> >> >> >>>> comment out the notification listener call(s) too.
> >> >> >>>
> >> >> >>>
> >> >> >>> I did two new sets of tests today. The first used more efficient
> >> >> >>> callback code in the GAHP (one notification consumer rather than
> >> one
> >> >> >>> per job). The second disabled notifications and relied on polling
> >> >> >>> for job status changes.
> >> >> >>>
> >> >> >>> The more efficient callback code did not produce a noticeable
> >> >> >>> reduction in memory usage.
> >> >> >>>
> >> >> >>> Disabling notifications did reduce memory usage. The maximum jvm
> >> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum
> >> >> >>> heap usage after job submission and before job completion was
> >> about
> >> >> >>> 4 megs + 0.1 megs per job.
> >> >> >>
> >> >> >>
> >> >> >> I ran one more test with the improved callback code. This time, I
> >> >> >> stopped storing the notification producer EPRs associated with the
> >> >> >> GRAM job resources. Memory usage went down markedly.
> >> >> >>
> >> >> >> I was told the client had to explicitly destroy these serve-side
> >> >> >> notification producer resources when it destroys the job,
> >> otherwise
> >> >> >> they hang around bogging down the server. Is this still the case?
> >> The
> >> >> >> server can't destroy notification producers when their sources of
> >> >> >> information are destroyed?
> >> >> >>
> >> >> >
> >> >> > This reminds me of the odd fact that i had to suddenly grant much
> >> more
> >> >> > memory to Condor-g as soon as condor-g started storing EPRs of
> >> >> > subscription resources to be able to destroy them eventually.
> >> >> > Those EPR's are maybe not so tiny as they look like.
> >> >> >
> >> >> > For 4.0: yes, currently you'll have to store and eventually destroy
> >> >> > subscription resources manually to avoid heaping up persistence
> >> data
> >> >> > on the server-side.
> >> >> > For 4.2: no, you won't have to store them. A job resource will
> >> >> > destroy all subscription resources when it's destroyed.
> >> >> >
> >> >> > Overall i suggest to concentrate on 4.2 gram since the "container
> >> >> > hangs in job destruction" problem won't exist anymore.
> >> >> >
> >> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2
> >> changes
> >> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes
> >> >> > sense
> >> >> > for us to do the 4.2-related changes in Gahp and hand it to you for
> >> >> > fine-tuning then?
> >> >> >
> >> >> > Martin
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
> >> >>
> >> >> > Mihael:
> >> >> >
> >> >> > That's great, thanks!
> >> >> >
> >> >> > Ian.
> >> >> >
> >> >> > Mihael Hategan wrote:
> >> >> >> I did a 1024 job run today with ws-gram.
> >> >> >> I painted the results here:
> >> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
> >> >> >>
> >> >> >> Seems like client memory per job is about 370k. Which is quite a
> >> lot.
> >> >> >> What kinda worries me is that it doesn't seem to go down after the
> >> >> >> jobs
> >> >> >> are done, so maybe there's a memory leak, or maybe the garbage
> >> >> >> collector
> >> >> >> doesn't do any major collections. I'll need to profile this to see
> >> >> >> exactly what we're talking about.
> >> >> >>
> >> >> >> The container memory is figured by looking at the process in
> >> /proc.
> >> >> >> It's
> >> >> >> total memory including shared libraries and things. But libraries
> >> >> >> take a
> >> >> >> fixed amount of space, so a fuzzy correlation can probably be
> >> made.
> >> >> >> It
> >> >> >> looks quite similar to the amount of memory eaten on the client
> >> side
> >> >> >> (per job).
> >> >> >>
> >> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time
> >> >> >> the
> >> >> >> jobs are submitted, but the machine itself seems responsive. I
> >> have
> >> >> >> yet
> >> >> >> to plot the exact submission time for each job.
> >> >> >>
> >> >> >> So at this point I would recommend trying ws-gram as long as there
> >> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and
> >> >> >> while
> >> >> >> making sure the jvm has enough heap. More than that seems like a
> >> >> >> gamble.
> >> >> >>
> >> >> >> Mihael
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> Swift-devel mailing list
> >> >> >> Swift-devel at ci.uchicago.edu
> >> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >>
> >
> >
> 
>