[Swift-devel] ws-gram tests

Fri Feb 8 16:46:06 CST 2008

>
> On Fri, 2008-02-08 at 16:32 -0600, feller at mcs.anl.gov wrote:
>> I can't see any stability issues here. The only thing i changed
>> is using
>>
>> EndpointReferenceType jobEPR = (EndpointReferenceType)
>>     ObjectSerializer.clone(response.getManagedJobEndpoint());
>>
>> instead of
>>
>> EndpointReferenceType jobEPR = response.getManagedJobEndpoint();
>>
>> at 2 or 3 locations in the code.
>>
>> Rachana uses cloning in core too. So it's supposed to be
>> a stable thing.
>>
>> A question though: Do you see a speedup in submission?
>
> I wasn't looking for that. Anything I should be aware of?
>

Well, i can see a quite big speedup and can't really explain it.
The only thing i did was that cloning. But i'm working on trunk and
i changed some things in job creation that allow faster job creation.
In 4.0 you might only see it in jobs without delegation.
It would be interesting for me if you see a higher submission rate
in jobs that don't have any links to delegated credentials in
the job description (so no jobCredentialEndpoint, no
stagingCredentialEndpoint, no transferCredentialEndpoints).

Martin

>>
>> Martin
>>
>>
>> > Yep. Looks much better. How stable is this otherwise?
>> >
>> > On Fri, 2008-02-08 at 15:02 -0600, Mihael Hategan wrote:
>> >> On a first look it indeed looks like the gc is more successful at
>> >> cleaning stuff up.
>> >>
>> >> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
>> >> > Try the attached 4.0 compliant jar in your tests by dropping
>> >> > it in your 4.0.x $GLOBUS_LOCATION/lib.
>> >> > My tests showed about 2MB memory increase per 100 GramJob
>> >> > objects which sounds to me like a reasonable number (about 20k
>> >> > per GramJob object ignoring the notification consumer manager
>> >> > in one job - if my calculations are right)
>> >> >
>> >> > Martin
>> >> >
>> >> > >
>> >> > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
>> >> > >> Mihael,
>> >> > >>
>> >> > >> i think i found the memory hole in GramJob.
>> >> > >> 100 jobs in a test of mine consumed about 23MB (constantly
>> >> > >> growing) before the fix and 8MB (very slowly growing) after
>> >> > >> the fix. The big part of that (7MB) is used right from the
>> >> > >> first job which may be the NotificationConsumerManager.
>> >> > >> Will commit that change soon to 4.0 branch and you may try
>> >> > >> it then.
>> >> > >> Are you using 4.0.x in your tests?
>> >> > >
>> >> > > Yes. If there are no API changes, you can send me the jar file. I
>> >> don't
>> >> > > have enough knowledge to selectively build WS-GRAM, nor enough
>> disk
>> >> > > space to build the whole GT.
>> >> > >
>> >> > >>
>> >> > >> Martin
>> >> > >>
>> >> > >> >>> >
>> >> > >> >>> > These are both hacks. I'm not sure I want to go there.
>> 300K
>> >> per
>> >> > >> job
>> >> > >> >>> is
>> >> > >> >>> a
>> >> > >> >>> > bit too much considering that swift (which has to consider
>> >> many
>> >> > >> more
>> >> > >> >>> > things) has less than 10K overhead per job.
>> >> > >> >>> >
>> >> > >> >>>
>> >> > >> >>>
>> >> > >> >>> For my better understanding:
>> >> > >> >>> Do you start up your own notification consumer manager that
>> >> listens
>> >> > >> for
>> >> > >> >>> notifications of all jobs or do you let each GramJob
>> instance
>> >> listen
>> >> > >> >>> for
>> >> > >> >>> notifications itself?
>> >> > >> >>> In case you listen for notifications yourself: do you store
>> >> > >> >>> GramJob objects or just EPR's of jobs and create GramJob
>> >> objects if
>> >> > >> >>> needed?
>> >> > >> >>
>> >> > >> >> Excellent points. I let each GramJob instance listen for
>> >> > >> notifications
>> >> > >> >> itself. What I observed is that it uses only one container
>> for
>> >> that.
>> >> > >> >>
>> >> > >> >
>> >> > >> > Shoot! i didn't know that and thought there would be a
>> container
>> >> per
>> >> > >> > GramJob in that case. That's the core mysteries with
>> >> notifications.
>> >> > >> > Anyway: I did a quick check some days ago and found that
>> GramJob
>> >> is
>> >> > >> > surprisingly greedy regarding memory as you said. I'll have to
>> >> further
>> >> > >> > check what it is, but will probably not do that before 4.2 is
>> >> out.
>> >> > >> >
>> >> > >> >
>> >> > >> >> Due to the above, a reference to the GramJob is kept anyway,
>> >> > >> regardless
>> >> > >> >> of whether that reference is in client code or the local
>> >> container.
>> >> > >> >>
>> >> > >> >> I'll try to profile a run and see if I can spot where the
>> >> problems
>> >> > >> are.
>> >> > >> >>
>> >> > >> >>>
>> >> > >> >>> Martin
>> >> > >> >>>
>> >> > >> >>> >>
>> >> > >> >>> >> The core team will be looking at improving notifications
>> >> once
>> >> > >> their
>> >> > >> >>> >> other 4.2 deliverables are done.
>> >> > >> >>> >>
>> >> > >> >>> >> -Stu
>> >> > >> >>> >>
>> >> > >> >>> >> Begin forwarded message:
>> >> > >> >>> >>
>> >> > >> >>> >> > From: feller at mcs.anl.gov
>> >> > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
>> >> > >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
>> >> > >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence
>> >> Martin"
>> >> > >> >>> >> <tmartin at physics.ucsd.edu
>> >> > >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles
>> bacon"
>> >> > >> >>> >> <bacon at mcs.anl.gov
>> >> > >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob
>> >> Gardner"
>> >> > >> >>> >> <rwg at hep.uchicago.edu
>> >> > >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
>> >> > >> >>> <roy at cs.wisc.edu>,
>> >> > >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
>> >> > >> >>> >> <miron at cs.wisc.edu
>> >> > >> >>> >> > >
>> >> > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
>> >> > >> >>> >> >
>> >> > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>> >> > >> >>> >> >>
>> >> > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>> >> > >> >>> >> >>>
>> >> > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey
>> wrote:
>> >> > >> >>> >> >>>>
>> >> > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G
>> with
>> >> WS
>> >> > >> GRAM
>> >> > >> >>> >> >>>>> raised some concerns about memory usage on the
>> client
>> >> side.
>> >> > >> I
>> >> > >> >>> did
>> >> > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server,
>> >> which
>> >> > >> >>> appeared
>> >> > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server
>> is
>> >> a
>> >> > >> >>> wrapper
>> >> > >> >>> >> >>>>> around the java client libraries for WS GRAM.
>> >> > >> >>> >> >>>>>
>> >> > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs
>> up
>> >> to 30
>> >> > >> at
>> >> > >> >>> a
>> >> > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with
>> minimal
>> >> data
>> >> > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission
>> and
>> >> > >> >>> execution.
>> >> > >> >>> >> >>>>> Here is what I've discovered so far.
>> >> > >> >>> >> >>>>>
>> >> > >> >>> >> >>>>> Aside from the heap available to the java code, the
>> >> jvm
>> >> > >> used
>> >> > >> >>> 117
>> >> > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared
>> >> memory.
>> >> > >> >>> Condor-G
>> >> > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509
>> DN)
>> >> pair.
>> >> > >> >>> >> >>>>>
>> >> > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the
>> garbage
>> >> > >> >>> collector)
>> >> > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the
>> GAHP
>> >> was
>> >> > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for
>> them
>> >> to
>> >> > >> >>> complete),
>> >> > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
>> >> > >> >>> >> >>>>>
>> >> > >> >>> >> >>>>> The only long-term memory per job that I know of in
>> >> the
>> >> > >> GAHP
>> >> > >> >>> is
>> >> > >> >>> >> >>>>> for the notification sink for job status callbacks.
>> >> 600kb
>> >> > >> >>> seems
>> >> > >> >>> a
>> >> > >> >>> >> >>>>> little high for that. Stu, could someone on Globus
>> >> help us
>> >> > >> >>> >> >>>>> determine if we're using the notification sinks
>> >> > >> inefficiently?
>> >> > >> >>> >> >>>>
>> >> > >> >>> >> >>>> Martin just looked and for the most part, there is
>> >> nothing
>> >> > >> >>> wrong
>> >> > >> >>> >> >>>> with how condor-g manages the callback sink.
>> >> > >> >>> >> >>>> However, one improvement that would reduce the
>> memory
>> >> used
>> >> > >> per
>> >> > >> >>> job
>> >> > >> >>> >> >>>> would be to not have a notification consumer per
>> job.
>> >> > >> Instead
>> >> > >> >>> use
>> >> > >> >>> >> >>>> one for all jobs.
>> >> > >> >>> >> >>>>
>> >> > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g
>> >> stress
>> >> > >> >>> tests
>> >> > >> >>> >> >>>> and found that notifications are building up on the
>> in
>> >> the
>> >> > >> >>> GRAM4
>> >> > >> >>> >> >>>> service container and that is causing delays which
>> seem
>> >> to
>> >> > >> be
>> >> > >> >>> >> >>>> causing multiple problems.  We're looking at this in
>> a
>> >> > >> separate
>> >> > >> >>> >> >>>> effort with the GT Core team.  But, after this was
>> >> clear,
>> >> > >> >>> Martin
>> >> > >> >>> >> >>>> re-
>> >> > >> >>> >> >>>> ran the condor-g test and relied on polling between
>> >> condor-g
>> >> > >> >>> and
>> >> > >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime,
>> >> could
>> >> > >> you
>> >> > >> >>> >> >>>> repeat the no-notification test and see the
>> difference
>> >> in
>> >> > >> >>> memory?
>> >> > >> >>> >> >>>> The changes would be to increase the polling
>> frequency
>> >> in
>> >> > >> >>> condor-g
>> >> > >> >>> >> >>>> and comment out the subscribe for notification.  You
>> >> could
>> >> > >> also
>> >> > >> >>> >> >>>> comment out the notification listener call(s) too.
>> >> > >> >>> >> >>>
>> >> > >> >>> >> >>>
>> >> > >> >>> >> >>> I did two new sets of tests today. The first used
>> more
>> >> > >> efficient
>> >> > >> >>> >> >>> callback code in the GAHP (one notification consumer
>> >> rather
>> >> > >> than
>> >> > >> >>> one
>> >> > >> >>> >> >>> per job). The second disabled notifications and
>> relied
>> >> on
>> >> > >> >>> polling
>> >> > >> >>> >> >>> for job status changes.
>> >> > >> >>> >> >>>
>> >> > >> >>> >> >>> The more efficient callback code did not produce a
>> >> noticeable
>> >> > >> >>> >> >>> reduction in memory usage.
>> >> > >> >>> >> >>>
>> >> > >> >>> >> >>> Disabling notifications did reduce memory usage. The
>> >> maximum
>> >> > >> jvm
>> >> > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job.
>> The
>> >> > >> minimum
>> >> > >> >>> >> >>> heap usage after job submission and before job
>> >> completion was
>> >> > >> >>> about
>> >> > >> >>> >> >>> 4 megs + 0.1 megs per job.
>> >> > >> >>> >> >>
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> I ran one more test with the improved callback code.
>> This
>> >> > >> time, I
>> >> > >> >>> >> >> stopped storing the notification producer EPRs
>> associated
>> >> with
>> >> > >> >>> the
>> >> > >> >>> >> >> GRAM job resources. Memory usage went down markedly.
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> I was told the client had to explicitly destroy these
>> >> > >> serve-side
>> >> > >> >>> >> >> notification producer resources when it destroys the
>> job,
>> >> > >> >>> otherwise
>> >> > >> >>> >> >> they hang around bogging down the server. Is this
>> still
>> >> the
>> >> > >> case?
>> >> > >> >>> The
>> >> > >> >>> >> >> server can't destroy notification producers when their
>> >> sources
>> >> > >> of
>> >> > >> >>> >> >> information are destroyed?
>> >> > >> >>> >> >>
>> >> > >> >>> >> >
>> >> > >> >>> >> > This reminds me of the odd fact that i had to suddenly
>> >> grant
>> >> > >> much
>> >> > >> >>> more
>> >> > >> >>> >> > memory to Condor-g as soon as condor-g started storing
>> >> EPRs of
>> >> > >> >>> >> > subscription resources to be able to destroy them
>> >> eventually.
>> >> > >> >>> >> > Those EPR's are maybe not so tiny as they look like.
>> >> > >> >>> >> >
>> >> > >> >>> >> > For 4.0: yes, currently you'll have to store and
>> >> eventually
>> >> > >> >>> destroy
>> >> > >> >>> >> > subscription resources manually to avoid heaping up
>> >> persistence
>> >> > >> >>> data
>> >> > >> >>> >> > on the server-side.
>> >> > >> >>> >> > For 4.2: no, you won't have to store them. A job
>> resource
>> >> will
>> >> > >> >>> >> > destroy all subscription resources when it's destroyed.
>> >> > >> >>> >> >
>> >> > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
>> >> > >> "container
>> >> > >> >>> >> > hangs in job destruction" problem won't exist anymore.
>> >> > >> >>> >> >
>> >> > >> >>> >> > Sorry, Jaime, i still can't provide you with 100%
>> reliable
>> >> 4.2
>> >> > >> >>> changes
>> >> > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder
>> if
>> >> it
>> >> > >> makes
>> >> > >> >>> >> > sense
>> >> > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand
>> it
>> >> to you
>> >> > >> >>> for
>> >> > >> >>> >> > fine-tuning then?
>> >> > >> >>> >> >
>> >> > >> >>> >> > Martin
>> >> > >> >>> >>
>> >> > >> >>> >>
>> >> > >> >>> >>
>> >> > >> >>> >>
>> >> > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
>> >> > >> >>> >>
>> >> > >> >>> >> > Mihael:
>> >> > >> >>> >> >
>> >> > >> >>> >> > That's great, thanks!
>> >> > >> >>> >> >
>> >> > >> >>> >> > Ian.
>> >> > >> >>> >> >
>> >> > >> >>> >> > Mihael Hategan wrote:
>> >> > >> >>> >> >> I did a 1024 job run today with ws-gram.
>> >> > >> >>> >> >> I painted the results here:
>> >> > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> Seems like client memory per job is about 370k. Which
>> is
>> >> quite
>> >> > >> a
>> >> > >> >>> lot.
>> >> > >> >>> >> >> What kinda worries me is that it doesn't seem to go
>> down
>> >> after
>> >> > >> >>> the
>> >> > >> >>> >> >> jobs
>> >> > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the
>> >> garbage
>> >> > >> >>> >> >> collector
>> >> > >> >>> >> >> doesn't do any major collections. I'll need to profile
>> >> this to
>> >> > >> >>> see
>> >> > >> >>> >> >> exactly what we're talking about.
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> The container memory is figured by looking at the
>> process
>> >> in
>> >> > >> >>> /proc.
>> >> > >> >>> >> >> It's
>> >> > >> >>> >> >> total memory including shared libraries and things.
>> But
>> >> > >> libraries
>> >> > >> >>> >> >> take a
>> >> > >> >>> >> >> fixed amount of space, so a fuzzy correlation can
>> >> probably be
>> >> > >> >>> made.
>> >> > >> >>> >> >> It
>> >> > >> >>> >> >> looks quite similar to the amount of memory eaten on
>> the
>> >> > >> client
>> >> > >> >>> side
>> >> > >> >>> >> >> (per job).
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work
>> during
>> >> the
>> >> > >> >>> time
>> >> > >> >>> >> >> the
>> >> > >> >>> >> >> jobs are submitted, but the machine itself seems
>> >> responsive. I
>> >> > >> >>> have
>> >> > >> >>> >> >> yet
>> >> > >> >>> >> >> to plot the exact submission time for each job.
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> So at this point I would recommend trying ws-gram as
>> long
>> >> as
>> >> > >> >>> there
>> >> > >> >>> >> >> aren't too many jobs involved (i.e. under 4000
>> parallel
>> >> jobs),
>> >> > >> >>> and
>> >> > >> >>> >> >> while
>> >> > >> >>> >> >> making sure the jvm has enough heap. More than that
>> seems
>> >> like
>> >> > >> a
>> >> > >> >>> >> >> gamble.
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> Mihael
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> _______________________________________________
>> >> > >> >>> >> >> Swift-devel mailing list
>> >> > >> >>> >> >> Swift-devel at ci.uchicago.edu
>> >> > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >> > >> >>> >> >>
>> >> > >> >>> >> >>
>> >> > >> >>> >> >
>> >> > >> >>> >>
>> >> > >> >>> >
>> >> > >> >>> >
>> >> > >> >>>
>> >> > >> >>>
>> >> > >> >>
>> >> > >> >>
>> >> > >> >
>> >> > >> >
>> >> > >> >
>> >> > >>
>> >> > >>
>> >> > >
>> >> > >
>> >>
>> >> _______________________________________________
>> >> Swift-devel mailing list
>> >> Swift-devel at ci.uchicago.edu
>> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >>
>> >
>> >
>>
>>
>
>