[Swift-devel] ws-gram tests

Fri Feb 8 13:46:00 CST 2008

> Thanks. I'll give it a try as people head home for the weekend and the
> heat in the queues is allowed to dissipate.
>
> My profiler says that some hefty amount of heap is used by a relatively
> low number of EndpointReferenceType objects. Btw, where do I get the
> sources for addressing?

It's included as a jar in wsrf, but you can also see the sources by
extracting java/lib-src/ws-addressing/ws-addressing.tar.gz of the
wsrf package.

so:
cvs co -r globus_4_0_6 wsrf
cd wsrf/java/lib-src/ws-addressing/
...

And yes, it seems to be the objects of type EndpointReferenceType.
Those seem to be beasts. Rachana once mentioned that they should be
trimmed when you get them from the stubs because they contain "SOAP crap".

GramJob stored the job-EPR and subscription-EPR as they came from
the output from the call to the factory stub.

In the new jar trimmed eprs (provided by ObjectSerializer.clone(eprObject))
are stored in GramJob objects instead of the raw ones.

Martin

> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
>> Try the attached 4.0 compliant jar in your tests by dropping
>> it in your 4.0.x $GLOBUS_LOCATION/lib.
>> My tests showed about 2MB memory increase per 100 GramJob
>> objects which sounds to me like a reasonable number (about 20k
>> per GramJob object ignoring the notification consumer manager
>> in one job - if my calculations are right)
>>
>> Martin
>>
>> >
>> > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
>> >> Mihael,
>> >>
>> >> i think i found the memory hole in GramJob.
>> >> 100 jobs in a test of mine consumed about 23MB (constantly
>> >> growing) before the fix and 8MB (very slowly growing) after
>> >> the fix. The big part of that (7MB) is used right from the
>> >> first job which may be the NotificationConsumerManager.
>> >> Will commit that change soon to 4.0 branch and you may try
>> >> it then.
>> >> Are you using 4.0.x in your tests?
>> >
>> > Yes. If there are no API changes, you can send me the jar file. I
>> don't
>> > have enough knowledge to selectively build WS-GRAM, nor enough disk
>> > space to build the whole GT.
>> >
>> >>
>> >> Martin
>> >>
>> >> >>> >
>> >> >>> > These are both hacks. I'm not sure I want to go there. 300K per
>> >> job
>> >> >>> is
>> >> >>> a
>> >> >>> > bit too much considering that swift (which has to consider many
>> >> more
>> >> >>> > things) has less than 10K overhead per job.
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>> For my better understanding:
>> >> >>> Do you start up your own notification consumer manager that
>> listens
>> >> for
>> >> >>> notifications of all jobs or do you let each GramJob instance
>> listen
>> >> >>> for
>> >> >>> notifications itself?
>> >> >>> In case you listen for notifications yourself: do you store
>> >> >>> GramJob objects or just EPR's of jobs and create GramJob objects
>> if
>> >> >>> needed?
>> >> >>
>> >> >> Excellent points. I let each GramJob instance listen for
>> >> notifications
>> >> >> itself. What I observed is that it uses only one container for
>> that.
>> >> >>
>> >> >
>> >> > Shoot! i didn't know that and thought there would be a container
>> per
>> >> > GramJob in that case. That's the core mysteries with notifications.
>> >> > Anyway: I did a quick check some days ago and found that GramJob is
>> >> > surprisingly greedy regarding memory as you said. I'll have to
>> further
>> >> > check what it is, but will probably not do that before 4.2 is out.
>> >> >
>> >> >
>> >> >> Due to the above, a reference to the GramJob is kept anyway,
>> >> regardless
>> >> >> of whether that reference is in client code or the local
>> container.
>> >> >>
>> >> >> I'll try to profile a run and see if I can spot where the problems
>> >> are.
>> >> >>
>> >> >>>
>> >> >>> Martin
>> >> >>>
>> >> >>> >>
>> >> >>> >> The core team will be looking at improving notifications once
>> >> their
>> >> >>> >> other 4.2 deliverables are done.
>> >> >>> >>
>> >> >>> >> -Stu
>> >> >>> >>
>> >> >>> >> Begin forwarded message:
>> >> >>> >>
>> >> >>> >> > From: feller at mcs.anl.gov
>> >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
>> >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
>> >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
>> >> >>> >> <tmartin at physics.ucsd.edu
>> >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
>> >> >>> >> <bacon at mcs.anl.gov
>> >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
>> >> >>> >> <rwg at hep.uchicago.edu
>> >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
>> >> >>> <roy at cs.wisc.edu>,
>> >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
>> >> >>> >> <miron at cs.wisc.edu
>> >> >>> >> > >
>> >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
>> >> >>> >> >
>> >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>> >> >>> >> >>
>> >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>> >> >>> >> >>>
>> >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
>> >> >>> >> >>>>
>> >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with
>> WS
>> >> GRAM
>> >> >>> >> >>>>> raised some concerns about memory usage on the client
>> side.
>> >> I
>> >> >>> did
>> >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
>> >> >>> appeared
>> >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a
>> >> >>> wrapper
>> >> >>> >> >>>>> around the java client libraries for WS GRAM.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to
>> 30
>> >> at
>> >> >>> a
>> >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal
>> data
>> >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
>> >> >>> execution.
>> >> >>> >> >>>>> Here is what I've discovered so far.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> Aside from the heap available to the java code, the jvm
>> >> used
>> >> >>> 117
>> >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
>> >> >>> Condor-G
>> >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN)
>> pair.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
>> >> >>> collector)
>> >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP
>> was
>> >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
>> >> >>> complete),
>> >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> The only long-term memory per job that I know of in the
>> >> GAHP
>> >> >>> is
>> >> >>> >> >>>>> for the notification sink for job status callbacks.
>> 600kb
>> >> >>> seems
>> >> >>> a
>> >> >>> >> >>>>> little high for that. Stu, could someone on Globus help
>> us
>> >> >>> >> >>>>> determine if we're using the notification sinks
>> >> inefficiently?
>> >> >>> >> >>>>
>> >> >>> >> >>>> Martin just looked and for the most part, there is
>> nothing
>> >> >>> wrong
>> >> >>> >> >>>> with how condor-g manages the callback sink.
>> >> >>> >> >>>> However, one improvement that would reduce the memory
>> used
>> >> per
>> >> >>> job
>> >> >>> >> >>>> would be to not have a notification consumer per job.
>> >> Instead
>> >> >>> use
>> >> >>> >> >>>> one for all jobs.
>> >> >>> >> >>>>
>> >> >>> >> >>>> Also, Martin recently did some analysis on condor-g
>> stress
>> >> >>> tests
>> >> >>> >> >>>> and found that notifications are building up on the in
>> the
>> >> >>> GRAM4
>> >> >>> >> >>>> service container and that is causing delays which seem
>> to
>> >> be
>> >> >>> >> >>>> causing multiple problems.  We're looking at this in a
>> >> separate
>> >> >>> >> >>>> effort with the GT Core team.  But, after this was clear,
>> >> >>> Martin
>> >> >>> >> >>>> re-
>> >> >>> >> >>>> ran the condor-g test and relied on polling between
>> condor-g
>> >> >>> and
>> >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could
>> >> you
>> >> >>> >> >>>> repeat the no-notification test and see the difference in
>> >> >>> memory?
>> >> >>> >> >>>> The changes would be to increase the polling frequency in
>> >> >>> condor-g
>> >> >>> >> >>>> and comment out the subscribe for notification.  You
>> could
>> >> also
>> >> >>> >> >>>> comment out the notification listener call(s) too.
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >>> I did two new sets of tests today. The first used more
>> >> efficient
>> >> >>> >> >>> callback code in the GAHP (one notification consumer
>> rather
>> >> than
>> >> >>> one
>> >> >>> >> >>> per job). The second disabled notifications and relied on
>> >> >>> polling
>> >> >>> >> >>> for job status changes.
>> >> >>> >> >>>
>> >> >>> >> >>> The more efficient callback code did not produce a
>> noticeable
>> >> >>> >> >>> reduction in memory usage.
>> >> >>> >> >>>
>> >> >>> >> >>> Disabling notifications did reduce memory usage. The
>> maximum
>> >> jvm
>> >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The
>> >> minimum
>> >> >>> >> >>> heap usage after job submission and before job completion
>> was
>> >> >>> about
>> >> >>> >> >>> 4 megs + 0.1 megs per job.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> I ran one more test with the improved callback code. This
>> >> time, I
>> >> >>> >> >> stopped storing the notification producer EPRs associated
>> with
>> >> >>> the
>> >> >>> >> >> GRAM job resources. Memory usage went down markedly.
>> >> >>> >> >>
>> >> >>> >> >> I was told the client had to explicitly destroy these
>> >> serve-side
>> >> >>> >> >> notification producer resources when it destroys the job,
>> >> >>> otherwise
>> >> >>> >> >> they hang around bogging down the server. Is this still the
>> >> case?
>> >> >>> The
>> >> >>> >> >> server can't destroy notification producers when their
>> sources
>> >> of
>> >> >>> >> >> information are destroyed?
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >> > This reminds me of the odd fact that i had to suddenly grant
>> >> much
>> >> >>> more
>> >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs
>> of
>> >> >>> >> > subscription resources to be able to destroy them
>> eventually.
>> >> >>> >> > Those EPR's are maybe not so tiny as they look like.
>> >> >>> >> >
>> >> >>> >> > For 4.0: yes, currently you'll have to store and eventually
>> >> >>> destroy
>> >> >>> >> > subscription resources manually to avoid heaping up
>> persistence
>> >> >>> data
>> >> >>> >> > on the server-side.
>> >> >>> >> > For 4.2: no, you won't have to store them. A job resource
>> will
>> >> >>> >> > destroy all subscription resources when it's destroyed.
>> >> >>> >> >
>> >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
>> >> "container
>> >> >>> >> > hangs in job destruction" problem won't exist anymore.
>> >> >>> >> >
>> >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable
>> 4.2
>> >> >>> changes
>> >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it
>> >> makes
>> >> >>> >> > sense
>> >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to
>> you
>> >> >>> for
>> >> >>> >> > fine-tuning then?
>> >> >>> >> >
>> >> >>> >> > Martin
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
>> >> >>> >>
>> >> >>> >> > Mihael:
>> >> >>> >> >
>> >> >>> >> > That's great, thanks!
>> >> >>> >> >
>> >> >>> >> > Ian.
>> >> >>> >> >
>> >> >>> >> > Mihael Hategan wrote:
>> >> >>> >> >> I did a 1024 job run today with ws-gram.
>> >> >>> >> >> I painted the results here:
>> >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>> >> >>> >> >>
>> >> >>> >> >> Seems like client memory per job is about 370k. Which is
>> quite
>> >> a
>> >> >>> lot.
>> >> >>> >> >> What kinda worries me is that it doesn't seem to go down
>> after
>> >> >>> the
>> >> >>> >> >> jobs
>> >> >>> >> >> are done, so maybe there's a memory leak, or maybe the
>> garbage
>> >> >>> >> >> collector
>> >> >>> >> >> doesn't do any major collections. I'll need to profile this
>> to
>> >> >>> see
>> >> >>> >> >> exactly what we're talking about.
>> >> >>> >> >>
>> >> >>> >> >> The container memory is figured by looking at the process
>> in
>> >> >>> /proc.
>> >> >>> >> >> It's
>> >> >>> >> >> total memory including shared libraries and things. But
>> >> libraries
>> >> >>> >> >> take a
>> >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably
>> be
>> >> >>> made.
>> >> >>> >> >> It
>> >> >>> >> >> looks quite similar to the amount of memory eaten on the
>> >> client
>> >> >>> side
>> >> >>> >> >> (per job).
>> >> >>> >> >>
>> >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during
>> the
>> >> >>> time
>> >> >>> >> >> the
>> >> >>> >> >> jobs are submitted, but the machine itself seems
>> responsive. I
>> >> >>> have
>> >> >>> >> >> yet
>> >> >>> >> >> to plot the exact submission time for each job.
>> >> >>> >> >>
>> >> >>> >> >> So at this point I would recommend trying ws-gram as long
>> as
>> >> >>> there
>> >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel
>> jobs),
>> >> >>> and
>> >> >>> >> >> while
>> >> >>> >> >> making sure the jvm has enough heap. More than that seems
>> like
>> >> a
>> >> >>> >> >> gamble.
>> >> >>> >> >>
>> >> >>> >> >> Mihael
>> >> >>> >> >>
>> >> >>> >> >> _______________________________________________
>> >> >>> >> >> Swift-devel mailing list
>> >> >>> >> >> Swift-devel at ci.uchicago.edu
>> >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >>
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >
>> >
>
>