[Swift-devel] ws-gram tests

Stuart Martin smartin at mcs.anl.gov
Fri Feb 8 09:33:53 CST 2008


Mihael,

Glad to hear things are improved with GRAM4.  Lets keep going to have  
swift using GRAM4 routinely.

Below is a recent thread that looked at this exact issue with condor- 
g.  But it is entirely relevant to your use of GRAM4. the 2 issues to  
look for are

1) your use of notifications

>> I ran one more test with the improved callback code. This time, I
>> stopped storing the notification producer EPRs associated with the
>> GRAM job resources. Memory usage went down markedly.

2) you could avoid notifications and instead do client-side polling  
for job state.  This has shown to be more reliable than notifications  
under heavy loads, condor-g processing 1000s of jobs.

The core team will be looking at improving notifications once their  
other 4.2 deliverables are done.

-Stu

Begin forwarded message:

> From: feller at mcs.anl.gov
> Date: February 1, 2008 9:41:05 AM CST
> To: "Jaime Frey" <jfrey at cs.wisc.edu>
> Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin" <tmartin at physics.ucsd.edu 
> >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon" <bacon at mcs.anl.gov 
> >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner" <rwg at hep.uchicago.edu 
> >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy" <roy at cs.wisc.edu>,  
> "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny" <miron at cs.wisc.edu 
> >
> Subject: Re: Condor-G WS GRAM memory usage
>
>> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>>
>>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>>>
>>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
>>>>
>>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM
>>>>> raised some concerns about memory usage on the client side. I did
>>>>> some profiling of Condor-G's WS GRAM GAHP server, which appeared
>>>>> to be the primary memory consumer. The GAHP server is a wrapper
>>>>> around the java client libraries for WS GRAM.
>>>>>
>>>>> In my tests, I submitted variable numbers of jobs up to 30 at a
>>>>> time. The jobs were 2-minute sleep jobs with minimal data
>>>>> transfer. All of the jobs overlapped in submission and execution.
>>>>> Here is what I've discovered so far.
>>>>>
>>>>> Aside from the heap available to the java code, the jvm used 117
>>>>> megs of non-shared memory and 74 megs of shared memory. Condor-G
>>>>> creates one GAHP server for each (local uid, X509 DN) pair.
>>>>>
>>>>> The maximum jvm heap usage (as reported by the garbage collector)
>>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
>>>>> quiescent (jobs executing, Condor-G waiting for them to complete),
>>>>> heap usage was about 5 megs plus 0.6 megs per job.
>>>>>
>>>>> The only long-term memory per job that I know of in the GAHP is
>>>>> for the notification sink for job status callbacks. 600kb seems a
>>>>> little high for that. Stu, could someone on Globus help us
>>>>> determine if we're using the notification sinks inefficiently?
>>>>
>>>> Martin just looked and for the most part, there is nothing wrong
>>>> with how condor-g manages the callback sink.
>>>> However, one improvement that would reduce the memory used per job
>>>> would be to not have a notification consumer per job.  Instead use
>>>> one for all jobs.
>>>>
>>>> Also, Martin recently did some analysis on condor-g stress tests
>>>> and found that notifications are building up on the in the GRAM4
>>>> service container and that is causing delays which seem to be
>>>> causing multiple problems.  We're looking at this in a separate
>>>> effort with the GT Core team.  But, after this was clear, Martin  
>>>> re-
>>>> ran the condor-g test and relied on polling between condor-g and
>>>> the GRAM4 service instead of notifications.  Jaime, could you
>>>> repeat the no-notification test and see the difference in memory?
>>>> The changes would be to increase the polling frequency in condor-g
>>>> and comment out the subscribe for notification.  You could also
>>>> comment out the notification listener call(s) too.
>>>
>>>
>>> I did two new sets of tests today. The first used more efficient
>>> callback code in the GAHP (one notification consumer rather than one
>>> per job). The second disabled notifications and relied on polling
>>> for job status changes.
>>>
>>> The more efficient callback code did not produce a noticeable
>>> reduction in memory usage.
>>>
>>> Disabling notifications did reduce memory usage. The maximum jvm
>>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum
>>> heap usage after job submission and before job completion was about
>>> 4 megs + 0.1 megs per job.
>>
>>
>> I ran one more test with the improved callback code. This time, I
>> stopped storing the notification producer EPRs associated with the
>> GRAM job resources. Memory usage went down markedly.
>>
>> I was told the client had to explicitly destroy these serve-side
>> notification producer resources when it destroys the job, otherwise
>> they hang around bogging down the server. Is this still the case? The
>> server can't destroy notification producers when their sources of
>> information are destroyed?
>>
>
> This reminds me of the odd fact that i had to suddenly grant much more
> memory to Condor-g as soon as condor-g started storing EPRs of
> subscription resources to be able to destroy them eventually.
> Those EPR's are maybe not so tiny as they look like.
>
> For 4.0: yes, currently you'll have to store and eventually destroy
> subscription resources manually to avoid heaping up persistence data
> on the server-side.
> For 4.2: no, you won't have to store them. A job resource will
> destroy all subscription resources when it's destroyed.
>
> Overall i suggest to concentrate on 4.2 gram since the "container
> hangs in job destruction" problem won't exist anymore.
>
> Sorry, Jaime, i still can't provide you with 100% reliable 4.2 changes
> in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes  
> sense
> for us to do the 4.2-related changes in Gahp and hand it to you for
> fine-tuning then?
>
> Martin




On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:

> Mihael:
>
> That's great, thanks!
>
> Ian.
>
> Mihael Hategan wrote:
>> I did a 1024 job run today with ws-gram.
>> I painted the results here:
>> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>>
>> Seems like client memory per job is about 370k. Which is quite a lot.
>> What kinda worries me is that it doesn't seem to go down after the  
>> jobs
>> are done, so maybe there's a memory leak, or maybe the garbage  
>> collector
>> doesn't do any major collections. I'll need to profile this to see
>> exactly what we're talking about.
>>
>> The container memory is figured by looking at the process in /proc.  
>> It's
>> total memory including shared libraries and things. But libraries  
>> take a
>> fixed amount of space, so a fuzzy correlation can probably be made.  
>> It
>> looks quite similar to the amount of memory eaten on the client side
>> (per job).
>>
>> CPU-load-wise, WS-GRAM behaves. There is some work during the time  
>> the
>> jobs are submitted, but the machine itself seems responsive. I have  
>> yet
>> to plot the exact submission time for each job.
>>
>> So at this point I would recommend trying ws-gram as long as there
>> aren't too many jobs involved (i.e. under 4000 parallel jobs), and  
>> while
>> making sure the jvm has enough heap. More than that seems like a  
>> gamble.
>>
>> Mihael
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>
>




More information about the Swift-devel mailing list