[Swift-devel] ws-gram tests

Fri Feb 8 14:15:09 CST 2008

ok, replace all gram jars with the attached ones.

> Won't fly:
>
> java.lang.NoClassDefFoundError: org/globus/exec/utils/audit/AuditUtil
>         at
> org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:952)
>         at org.globus.exec.client.GramJob.submit(GramJob.java:447)
>
> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
>> Try the attached 4.0 compliant jar in your tests by dropping
>> it in your 4.0.x $GLOBUS_LOCATION/lib.
>> My tests showed about 2MB memory increase per 100 GramJob
>> objects which sounds to me like a reasonable number (about 20k
>> per GramJob object ignoring the notification consumer manager
>> in one job - if my calculations are right)
>>
>> Martin
>>
>> >
>> > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
>> >> Mihael,
>> >>
>> >> i think i found the memory hole in GramJob.
>> >> 100 jobs in a test of mine consumed about 23MB (constantly
>> >> growing) before the fix and 8MB (very slowly growing) after
>> >> the fix. The big part of that (7MB) is used right from the
>> >> first job which may be the NotificationConsumerManager.
>> >> Will commit that change soon to 4.0 branch and you may try
>> >> it then.
>> >> Are you using 4.0.x in your tests?
>> >
>> > Yes. If there are no API changes, you can send me the jar file. I
>> don't
>> > have enough knowledge to selectively build WS-GRAM, nor enough disk
>> > space to build the whole GT.
>> >
>> >>
>> >> Martin
>> >>
>> >> >>> >
>> >> >>> > These are both hacks. I'm not sure I want to go there. 300K per
>> >> job
>> >> >>> is
>> >> >>> a
>> >> >>> > bit too much considering that swift (which has to consider many
>> >> more
>> >> >>> > things) has less than 10K overhead per job.
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>> For my better understanding:
>> >> >>> Do you start up your own notification consumer manager that
>> listens
>> >> for
>> >> >>> notifications of all jobs or do you let each GramJob instance
>> listen
>> >> >>> for
>> >> >>> notifications itself?
>> >> >>> In case you listen for notifications yourself: do you store
>> >> >>> GramJob objects or just EPR's of jobs and create GramJob objects
>> if
>> >> >>> needed?
>> >> >>
>> >> >> Excellent points. I let each GramJob instance listen for
>> >> notifications
>> >> >> itself. What I observed is that it uses only one container for
>> that.
>> >> >>
>> >> >
>> >> > Shoot! i didn't know that and thought there would be a container
>> per
>> >> > GramJob in that case. That's the core mysteries with notifications.
>> >> > Anyway: I did a quick check some days ago and found that GramJob is
>> >> > surprisingly greedy regarding memory as you said. I'll have to
>> further
>> >> > check what it is, but will probably not do that before 4.2 is out.
>> >> >
>> >> >
>> >> >> Due to the above, a reference to the GramJob is kept anyway,
>> >> regardless
>> >> >> of whether that reference is in client code or the local
>> container.
>> >> >>
>> >> >> I'll try to profile a run and see if I can spot where the problems
>> >> are.
>> >> >>
>> >> >>>
>> >> >>> Martin
>> >> >>>
>> >> >>> >>
>> >> >>> >> The core team will be looking at improving notifications once
>> >> their
>> >> >>> >> other 4.2 deliverables are done.
>> >> >>> >>
>> >> >>> >> -Stu
>> >> >>> >>
>> >> >>> >> Begin forwarded message:
>> >> >>> >>
>> >> >>> >> > From: feller at mcs.anl.gov
>> >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
>> >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
>> >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
>> >> >>> >> <tmartin at physics.ucsd.edu
>> >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
>> >> >>> >> <bacon at mcs.anl.gov
>> >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
>> >> >>> >> <rwg at hep.uchicago.edu
>> >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
>> >> >>> <roy at cs.wisc.edu>,
>> >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
>> >> >>> >> <miron at cs.wisc.edu
>> >> >>> >> > >
>> >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
>> >> >>> >> >
>> >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>> >> >>> >> >>
>> >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>> >> >>> >> >>>
>> >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
>> >> >>> >> >>>>
>> >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with
>> WS
>> >> GRAM
>> >> >>> >> >>>>> raised some concerns about memory usage on the client
>> side.
>> >> I
>> >> >>> did
>> >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
>> >> >>> appeared
>> >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a
>> >> >>> wrapper
>> >> >>> >> >>>>> around the java client libraries for WS GRAM.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to
>> 30
>> >> at
>> >> >>> a
>> >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal
>> data
>> >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
>> >> >>> execution.
>> >> >>> >> >>>>> Here is what I've discovered so far.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> Aside from the heap available to the java code, the jvm
>> >> used
>> >> >>> 117
>> >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
>> >> >>> Condor-G
>> >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN)
>> pair.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
>> >> >>> collector)
>> >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP
>> was
>> >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
>> >> >>> complete),
>> >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> The only long-term memory per job that I know of in the
>> >> GAHP
>> >> >>> is
>> >> >>> >> >>>>> for the notification sink for job status callbacks.
>> 600kb
>> >> >>> seems
>> >> >>> a
>> >> >>> >> >>>>> little high for that. Stu, could someone on Globus help
>> us
>> >> >>> >> >>>>> determine if we're using the notification sinks
>> >> inefficiently?
>> >> >>> >> >>>>
>> >> >>> >> >>>> Martin just looked and for the most part, there is
>> nothing
>> >> >>> wrong
>> >> >>> >> >>>> with how condor-g manages the callback sink.
>> >> >>> >> >>>> However, one improvement that would reduce the memory
>> used
>> >> per
>> >> >>> job
>> >> >>> >> >>>> would be to not have a notification consumer per job.
>> >> Instead
>> >> >>> use
>> >> >>> >> >>>> one for all jobs.
>> >> >>> >> >>>>
>> >> >>> >> >>>> Also, Martin recently did some analysis on condor-g
>> stress
>> >> >>> tests
>> >> >>> >> >>>> and found that notifications are building up on the in
>> the
>> >> >>> GRAM4
>> >> >>> >> >>>> service container and that is causing delays which seem
>> to
>> >> be
>> >> >>> >> >>>> causing multiple problems.  We're looking at this in a
>> >> separate
>> >> >>> >> >>>> effort with the GT Core team.  But, after this was clear,
>> >> >>> Martin
>> >> >>> >> >>>> re-
>> >> >>> >> >>>> ran the condor-g test and relied on polling between
>> condor-g
>> >> >>> and
>> >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could
>> >> you
>> >> >>> >> >>>> repeat the no-notification test and see the difference in
>> >> >>> memory?
>> >> >>> >> >>>> The changes would be to increase the polling frequency in
>> >> >>> condor-g
>> >> >>> >> >>>> and comment out the subscribe for notification.  You
>> could
>> >> also
>> >> >>> >> >>>> comment out the notification listener call(s) too.
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >>> I did two new sets of tests today. The first used more
>> >> efficient
>> >> >>> >> >>> callback code in the GAHP (one notification consumer
>> rather
>> >> than
>> >> >>> one
>> >> >>> >> >>> per job). The second disabled notifications and relied on
>> >> >>> polling
>> >> >>> >> >>> for job status changes.
>> >> >>> >> >>>
>> >> >>> >> >>> The more efficient callback code did not produce a
>> noticeable
>> >> >>> >> >>> reduction in memory usage.
>> >> >>> >> >>>
>> >> >>> >> >>> Disabling notifications did reduce memory usage. The
>> maximum
>> >> jvm
>> >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The
>> >> minimum
>> >> >>> >> >>> heap usage after job submission and before job completion
>> was
>> >> >>> about
>> >> >>> >> >>> 4 megs + 0.1 megs per job.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> I ran one more test with the improved callback code. This
>> >> time, I
>> >> >>> >> >> stopped storing the notification producer EPRs associated
>> with
>> >> >>> the
>> >> >>> >> >> GRAM job resources. Memory usage went down markedly.
>> >> >>> >> >>
>> >> >>> >> >> I was told the client had to explicitly destroy these
>> >> serve-side
>> >> >>> >> >> notification producer resources when it destroys the job,
>> >> >>> otherwise
>> >> >>> >> >> they hang around bogging down the server. Is this still the
>> >> case?
>> >> >>> The
>> >> >>> >> >> server can't destroy notification producers when their
>> sources
>> >> of
>> >> >>> >> >> information are destroyed?
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >> > This reminds me of the odd fact that i had to suddenly grant
>> >> much
>> >> >>> more
>> >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs
>> of
>> >> >>> >> > subscription resources to be able to destroy them
>> eventually.
>> >> >>> >> > Those EPR's are maybe not so tiny as they look like.
>> >> >>> >> >
>> >> >>> >> > For 4.0: yes, currently you'll have to store and eventually
>> >> >>> destroy
>> >> >>> >> > subscription resources manually to avoid heaping up
>> persistence
>> >> >>> data
>> >> >>> >> > on the server-side.
>> >> >>> >> > For 4.2: no, you won't have to store them. A job resource
>> will
>> >> >>> >> > destroy all subscription resources when it's destroyed.
>> >> >>> >> >
>> >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
>> >> "container
>> >> >>> >> > hangs in job destruction" problem won't exist anymore.
>> >> >>> >> >
>> >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable
>> 4.2
>> >> >>> changes
>> >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it
>> >> makes
>> >> >>> >> > sense
>> >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to
>> you
>> >> >>> for
>> >> >>> >> > fine-tuning then?
>> >> >>> >> >
>> >> >>> >> > Martin
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
>> >> >>> >>
>> >> >>> >> > Mihael:
>> >> >>> >> >
>> >> >>> >> > That's great, thanks!
>> >> >>> >> >
>> >> >>> >> > Ian.
>> >> >>> >> >
>> >> >>> >> > Mihael Hategan wrote:
>> >> >>> >> >> I did a 1024 job run today with ws-gram.
>> >> >>> >> >> I painted the results here:
>> >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>> >> >>> >> >>
>> >> >>> >> >> Seems like client memory per job is about 370k. Which is
>> quite
>> >> a
>> >> >>> lot.
>> >> >>> >> >> What kinda worries me is that it doesn't seem to go down
>> after
>> >> >>> the
>> >> >>> >> >> jobs
>> >> >>> >> >> are done, so maybe there's a memory leak, or maybe the
>> garbage
>> >> >>> >> >> collector
>> >> >>> >> >> doesn't do any major collections. I'll need to profile this
>> to
>> >> >>> see
>> >> >>> >> >> exactly what we're talking about.
>> >> >>> >> >>
>> >> >>> >> >> The container memory is figured by looking at the process
>> in
>> >> >>> /proc.
>> >> >>> >> >> It's
>> >> >>> >> >> total memory including shared libraries and things. But
>> >> libraries
>> >> >>> >> >> take a
>> >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably
>> be
>> >> >>> made.
>> >> >>> >> >> It
>> >> >>> >> >> looks quite similar to the amount of memory eaten on the
>> >> client
>> >> >>> side
>> >> >>> >> >> (per job).
>> >> >>> >> >>
>> >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during
>> the
>> >> >>> time
>> >> >>> >> >> the
>> >> >>> >> >> jobs are submitted, but the machine itself seems
>> responsive. I
>> >> >>> have
>> >> >>> >> >> yet
>> >> >>> >> >> to plot the exact submission time for each job.
>> >> >>> >> >>
>> >> >>> >> >> So at this point I would recommend trying ws-gram as long
>> as
>> >> >>> there
>> >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel
>> jobs),
>> >> >>> and
>> >> >>> >> >> while
>> >> >>> >> >> making sure the jvm has enough heap. More than that seems
>> like
>> >> a
>> >> >>> >> >> gamble.
>> >> >>> >> >>
>> >> >>> >> >> Mihael
>> >> >>> >> >>
>> >> >>> >> >> _______________________________________________
>> >> >>> >> >> Swift-devel mailing list
>> >> >>> >> >> Swift-devel at ci.uchicago.edu
>> >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >>
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >
>> >
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gramjars.tar.gz
Type: application/x-gzip
Size: 531778 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080208/308a0954/attachment.bin>