Out of memory problem

Tue Sep 8 10:13:48 CDT 2009

On Tue, Sep 08, 2009 at 04:24:01PM +0200, Julien Bodart wrote:
> Hi rob,
> 
> You are probably right about the netcdf mailing list. I reply to you only.
> Thanks for your answer. I am actually working in France and the BG/P is
> located in Paris.

Oh, I'm terribly sorry!  I meant that I personally was going off on a
very low-level tangent.  Your discussion is absolutely appropriate for
the mailing list, and so I have returned the discussion there.

> The run I am talking about is on 8K cores. I already knew that
> allocation/deallocation should be avoided when working on BG.
> That's why I suppressed all unnecessary allocation in my code during
> runtime. I still have a few string or small arrays but I don't think it
> could be the problem. So, it can't be of course 100% sure, but I don't think
> there can't be any big memory leaks in my code any more.

8k cores is not an unusually huge run, so we should be able to make
this work for you.

> I suspect the memory leaks to happen during the data saving process.
> That's the reason why I was asking on the par-netcdf mailing list.
> If I reduce the saving frequency, then the problem does not happen
> any more. I actually did not think about the MPI implementation. I
> am currently using the default BGP library.

If you are using the pnetcdf *_all routines, then you will actually be
using a different set of functions.  Do you switch to independent
access?  

On our bgp we have a tool called 'efix_version.pl', which says we've
installed several IBM EFIX updates to the MPI library: 7, 13, and 36,
but even so I know there are at least two resource leaks in the system
MPI library.  

The pending "v1r4" update should fix a lot of these resource leaks in
the MPI library, but I think Argonne is going to be one of the first
sites to test that driver, and we haven't installed it yet.

> Thanks for the link anyway, I did not know how to get the
> available/consumed memory. I will make some test using it
> before/after data saving, and try to confirm my guess.

That would be great.  We could be leaking memory in pnetcdf, but I did
go clean up a lot of the worst leaks in for the 1.0.3 release.

> Unfortunately I can't post the backtrace as I have stupidly lost it.

That's ok, but if you re-run your program and get another core file,
do sent it.

In the meantime, can you tell us more about the netcdf dataset you are
creating and how you are creating it?   If you have a successful run
(maybe because you saved less frequently), what does either 'ncdump
-h' or 'ncmpidump -h' look like?

Do you create the dataset and then write the same variables over and
over, or are you creating one dataset per save?

I think we should be able to get your app working.  This sort of
discussion is perfect for the mailing list -- we'll find and fix a bug
somewhere and the archives will let the next person who sees this sort
of thing know what to try.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA