Out of memory problem

Rob Latham robl at mcs.anl.gov
Tue Sep 8 08:43:56 CDT 2009


On Sat, Sep 05, 2009 at 01:44:55PM +0200, Julien Bodart wrote:
> Hi everybody,
> 
> I am using parallel-netcdf on a BGene/P computer and I get some
> troubles when running  large cases.

How large (# of cores)?

> The error message looks like:
> 
> "abort(1) on node 2820 (rank 2820 in comm 1140850688): application called
> MPI_Abort(MPI_COMM_WORLD, 1) - process 2820
> Out of memory in file
> /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/ad_bgl_wrcoll.c,
> line 1238"
> 
> It looks like a memory leak as the program is able to perform many
> intermediate saving during the run and suddenly stop with the
> previous error message.

If I'm looking at the right source code, this assertion happens in the
function ADIOI_W_Exchange_data_alltoallv().  Now, there's no reason to
really care about that on the parallel-netcdf mailing list, but I did
just last month spend a ton of time tracking down a problem in this
very routine (different cause, but similar result).

> The core file says it happens during saving independent data. It is
> not the first time I got such a problem when reading/saving
> independent data. I usually work around it using collective
> communications before saving/reading data, but this is not possible
> this time.

Can you post the backtrace
(https://wiki.alcf.anl.gov/index.php/Debugging#Decoding_core_files but
I think you already know these steps)?

> So my question is: Do you think it comes from my code which might
> have some memory leaks and causes troubles when pnetcdf is
> allocating temporary arrays, or it might comes from a wrong using of
> pnetcdf with independent data.

The three places that might have a problem are your code,
parallel-netcdf, but also the MPI library.

Are you on Argonne's BGP ?  If so, I've got a custom-built MPI library
that has addressed several resource leaks in the past.    It's
installed under ~robl/soft/dcmf-2009078 (i.e.  use
~robl/soft/dcmf-2009078/bin/mpif90 to build your program)

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the parallel-netcdf mailing list