[mpich-discuss] Fatal error in PMPI_Alltoall: Other MPI error, error stack

Dave Goodell goodell at mcs.anl.gov
Fri Oct 19 14:26:30 CDT 2012


The error means that the "handle allocator" under the hood for datatype objects was unable to further allocate memory via "calloc".  In some cases this can actually be a fair amount of memory (maybe 10s-100s of MiB?), but I can't tell you precisely how much off the top of my head.

You almost certainly have some sort of memory leak or general memory consumption problem.  You might try using Valgrind's leak checking facility (on a smaller problem, since running under valgrind consumes extra memory).  You might also try Valgrind's "massif" heap profiling tool.

http://valgrind.org/docs/manual/mc-manual.html#mc-manual.leaks
http://valgrind.org/docs/manual/ms-manual.html

A lighter weight memory profiling option is available in GNU libc, but you'll need to put your thinking cap on a bit tighter in order to get it running and decipher its output:

http://www.gnu.org/software/libc/manual/html_node/Allocation-Debugging.html

-Dave

On Oct 18, 2012, at 9:37 PM CDT, Ryan Crocker wrote:

> I use my activity monitor to see if the amount of virtual and physical memory for the simulation increases per processor as i run a simulation it does not.  So there is no blatant memory leak.  I've done a core dump and i am below 2GB of allocated memory per node on the cluster and my workstation.  I really need to know what specifically that error means.  Is that node out of memory, are the send and receive buffers different sizes, what?   I just coppied one error but when i run in verbose every processor spits out that error, so that has me stumped.
> 
> On Oct 18, 2012, at 6:16 PM, Jeff Hammond wrote:
> 
>> What do you mean "I've also watched the memory usage..."?  Are you
>> using a memory profiler, such as the one in TAU?  Calling "free" from
>> the command line while a simulation is running is not a reliable way
>> to determine if the application is allocating memory.
>> 
>> Jeff
>> 
>> On Thu, Oct 18, 2012 at 5:24 PM, Ryan Crocker <rcrocker at uvm.edu> wrote:
>>> I'm implementing MPICH2_1.3.1 compiled with gcc and gfortran in 64bit on a linux cluster on 144 processors with 2GB per node.  I'm running an in house flow solver coded in fortran and for some reason i get this error:
>>> 
>>> MXMPI:FATAL-ERROR:0:Fatal error in PMPI_Alltoall: Other MPI error, error stack:
>>> PMPI_Alltoall(773).....................: MPI_Alltoall(sbuf=0x273436e0, scount=22, MPI_DOUBLE_PRECISION, rbuf=0x1cd98fb0, rcount=22, MPI_DOUBLE_PRECISION, comm=0x84000001) failed
>>> MPIR_Alltoall_impl(651)................:
>>> MPIR_Alltoall(619).....................:
>>> MPIR_Alltoall_intra(206)...............:
>>> MPIR_Type_create_indexed_block_impl(48):
>>> MPID_Type_vector(57)...................: Out of memory
>>> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
>>> 
>>> it happens about 2000 iteration into my run.  I've put run the exact same simulation on a mac workstation and i do not get this error.  I've also watched the memory usage, it does not increase during my run on the work station.
>>> 
>>> So far i've tried adding  MPI_BARRIER in front of my alltoall calls but does not seem to help.  I've also updated a local version of mpich2_1.5, and though slow i've run into the same problem, at the same iteration number.
>>> 
>>> Ryan Crocker
>>> University of Vermont, School of Engineering
>>> Mechanical Engineering Department
>>> rcrocker at uvm.edu
>>> 315-212-7331
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> 
>> 
>> -- 
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> Ryan Crocker
> University of Vermont, School of Engineering
> Mechanical Engineering Department
> rcrocker at uvm.edu
> 315-212-7331
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list