Out of memory problem
Rob Latham
robl at mcs.anl.gov
Tue Sep 8 08:43:56 CDT 2009
On Sat, Sep 05, 2009 at 01:44:55PM +0200, Julien Bodart wrote:
> Hi everybody,
>
> I am using parallel-netcdf on a BGene/P computer and I get some
> troubles when running large cases.
How large (# of cores)?
> The error message looks like:
>
> "abort(1) on node 2820 (rank 2820 in comm 1140850688): application called
> MPI_Abort(MPI_COMM_WORLD, 1) - process 2820
> Out of memory in file
> /bglhome/usr6/bgbuild/V1R3M0_460_2008-081112P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/ad_bgl_wrcoll.c,
> line 1238"
>
> It looks like a memory leak as the program is able to perform many
> intermediate saving during the run and suddenly stop with the
> previous error message.
If I'm looking at the right source code, this assertion happens in the
function ADIOI_W_Exchange_data_alltoallv(). Now, there's no reason to
really care about that on the parallel-netcdf mailing list, but I did
just last month spend a ton of time tracking down a problem in this
very routine (different cause, but similar result).
> The core file says it happens during saving independent data. It is
> not the first time I got such a problem when reading/saving
> independent data. I usually work around it using collective
> communications before saving/reading data, but this is not possible
> this time.
Can you post the backtrace
(https://wiki.alcf.anl.gov/index.php/Debugging#Decoding_core_files but
I think you already know these steps)?
> So my question is: Do you think it comes from my code which might
> have some memory leaks and causes troubles when pnetcdf is
> allocating temporary arrays, or it might comes from a wrong using of
> pnetcdf with independent data.
The three places that might have a problem are your code,
parallel-netcdf, but also the MPI library.
Are you on Argonne's BGP ? If so, I've got a custom-built MPI library
that has addressed several resource leaks in the past. It's
installed under ~robl/soft/dcmf-2009078 (i.e. use
~robl/soft/dcmf-2009078/bin/mpif90 to build your program)
==rob
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the parallel-netcdf
mailing list