[mpich-discuss] Failures with Inter-communicator Collectives

Krishna Chaitanya Kandalla kandalla at cse.ohio-state.edu
Tue Oct 12 12:04:01 CDT 2010


Hi MPICH2 Developers,
      We found that if MPICH2-1.3rc2 is configured
with --enable-nemesis-shm-collectives, then some of the inter-communicator
collective tests in the MPICH2 test suite, such as icbcast and icreduce
fail. It appears to be some form of memory corruption, given the nature of
the error messages that we are seeing:

*** glibc detected *** ./icreduce: double free or corruption (!prev):
0x000000000413b720 ***
======= Backtrace: =========
/lib64/libc.so.6[0x35a787230f]
/lib64/libc.so.6(cfree+0x4b)[0x35a787276b]
./icreduce[0x438ae1]
./icreduce[0x41bd57]
./icreduce[0x41c085]
./icreduce[0x41bf86]
./icreduce[0x41c085]
./icreduce[0x418ca7]
./icreduce[0x404bf6]
./icreduce[0x40228a]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x35a781d994]
./icreduce[0x401f69]
======= Memory map: ========
00400000-004b6000 r-xp 00000000 00:17 2693971
 /home/kandalla/mpich2-1.3rc2/icreduce (deleted)
006b6000-006b8000 rw-p 000b6000 00:17 2693971
 /home/kandalla/mpich2-1.3rc2/icreduce (deleted)
006b8000-006df000 rw-p 006b8000 00:00 0
04136000-0415f000 rw-p 04136000 00:00 0
 [heap]
35a7400000-35a741c000 r-xp 00000000 fd:00 4391224
 /lib64/ld-2.5.so
35a761b000-35a761c000 r--p 0001b000 fd:00 4391224
 /lib64/ld-2.5.so
35a761c000-35a761d000 rw-p 0001c000 fd:00 4391224
 /lib64/ld-2.5.so
35a7800000-35a794e000 r-xp 00000000 fd:00 4391225
 /lib64/libc-2.5.so
35a794e000-35a7b4d000 ---p 0014e000 fd:00 4391225
 /lib64/libc-2.5.so
35a7b4d000-35a7b51000 r--p 0014d000 fd:00 4391225
 /lib64/libc-2.5.so
35a7b51000-35a7b52000 rw-p 00151000 fd:00 4391225
 /lib64/libc-2.5.so
35a7b52000-35a7b57000 rw-p 35a7b52000 00:00 0
35a8400000-35a8416000 r-xp 00000000 fd:00 4391230
 /lib64/libpthread-2.5.so
35a8416000-35a8615000 ---p 00016000 fd:00 4391230
 /lib64/libpthread-2.5.so
35a8615000-35a8616000 r--p 00015000 fd:00 4391230
 /lib64/libpthread-2.5.so
35a8616000-35a8617000 rw-p 00016000 fd:00 4391230
 /lib64/libpthread-2.5.so
35a8617000-35a861b000 rw-p 35a8617000 00:00 0
35aa000000-35aa00d000 r-xp 00000000 fd:00 4391237
 /lib64/libgcc_s-4.1.2-20080825.so.1
35aa00d000-35aa20d000 ---p 0000d000 fd:00 4391237
 /lib64/libgcc_s-4.1.2-20080825.so.1
35aa20d000-35aa20e000 rw-p 0000d000 fd:00 4391237
 /lib64/libgcc_s-4.1.2-20080825.sPMPI_Comm_split(400)..............:
MPIR_Comm_split_impl(88)..........:
MPIR_Allgather_impl(744)..........:
MPIR_Allgather(705)...............:
MPIR_Allgather_intra(177).........:
MPIC_Sendrecv(189)................:
MPIC_Wait(528)....................:
MPIDI_CH3I_Progress(334)..........:
MPID_nem_mpich2_blocking_recv(906):
MPID_nem_tcp_connpoll(1875).......:
state_commrdy_handler(1703).......:
MPID_nem_tcp_recv_handler(1682)...: Communication error with rank 4
MPID_nem_tcp_recv_handler(1582)...: socket closed

              Please let us know if you feel any other clarification
regarding this error.

Thanks,
Krishna
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101012/344ef238/attachment.htm>


More information about the mpich-discuss mailing list