[mpich-discuss] Failures with Inter-communicator Collectives

Darius Buntinas buntinas at mcs.anl.gov
Tue Oct 12 12:09:43 CDT 2010


Can you guys run it through valgrind?

Thanks,
-d

On Oct 12, 2010, at 12:04 PM, Krishna Chaitanya Kandalla wrote:

> Hi MPICH2 Developers,
>       We found that if MPICH2-1.3rc2 is configured with --enable-nemesis-shm-collectives, then some of the inter-communicator collective tests in the MPICH2 test suite, such as icbcast and icreduce fail. It appears to be some form of memory corruption, given the nature of the error messages that we are seeing: 
> 
> *** glibc detected *** ./icreduce: double free or corruption (!prev): 0x000000000413b720 ***
> ======= Backtrace: =========
> /lib64/libc.so.6[0x35a787230f]
> /lib64/libc.so.6(cfree+0x4b)[0x35a787276b]
> ./icreduce[0x438ae1]
> ./icreduce[0x41bd57]
> ./icreduce[0x41c085]
> ./icreduce[0x41bf86]
> ./icreduce[0x41c085]
> ./icreduce[0x418ca7]
> ./icreduce[0x404bf6]
> ./icreduce[0x40228a]
> /lib64/libc.so.6(__libc_start_main+0xf4)[0x35a781d994]
> ./icreduce[0x401f69]
> ======= Memory map: ========
> 00400000-004b6000 r-xp 00000000 00:17 2693971                            /home/kandalla/mpich2-1.3rc2/icreduce (deleted)
> 006b6000-006b8000 rw-p 000b6000 00:17 2693971                            /home/kandalla/mpich2-1.3rc2/icreduce (deleted)
> 006b8000-006df000 rw-p 006b8000 00:00 0 
> 04136000-0415f000 rw-p 04136000 00:00 0                                  [heap]
> 35a7400000-35a741c000 r-xp 00000000 fd:00 4391224                        /lib64/ld-2.5.so
> 35a761b000-35a761c000 r--p 0001b000 fd:00 4391224                        /lib64/ld-2.5.so
> 35a761c000-35a761d000 rw-p 0001c000 fd:00 4391224                        /lib64/ld-2.5.so
> 35a7800000-35a794e000 r-xp 00000000 fd:00 4391225                        /lib64/libc-2.5.so
> 35a794e000-35a7b4d000 ---p 0014e000 fd:00 4391225                        /lib64/libc-2.5.so
> 35a7b4d000-35a7b51000 r--p 0014d000 fd:00 4391225                        /lib64/libc-2.5.so
> 35a7b51000-35a7b52000 rw-p 00151000 fd:00 4391225                        /lib64/libc-2.5.so
> 35a7b52000-35a7b57000 rw-p 35a7b52000 00:00 0 
> 35a8400000-35a8416000 r-xp 00000000 fd:00 4391230                        /lib64/libpthread-2.5.so
> 35a8416000-35a8615000 ---p 00016000 fd:00 4391230                        /lib64/libpthread-2.5.so
> 35a8615000-35a8616000 r--p 00015000 fd:00 4391230                        /lib64/libpthread-2.5.so
> 35a8616000-35a8617000 rw-p 00016000 fd:00 4391230                        /lib64/libpthread-2.5.so
> 35a8617000-35a861b000 rw-p 35a8617000 00:00 0 
> 35aa000000-35aa00d000 r-xp 00000000 fd:00 4391237                        /lib64/libgcc_s-4.1.2-20080825.so.1
> 35aa00d000-35aa20d000 ---p 0000d000 fd:00 4391237                        /lib64/libgcc_s-4.1.2-20080825.so.1
> 35aa20d000-35aa20e000 rw-p 0000d000 fd:00 4391237                        /lib64/libgcc_s-4.1.2-20080825.sPMPI_Comm_split(400)..............: 
> MPIR_Comm_split_impl(88)..........: 
> MPIR_Allgather_impl(744)..........: 
> MPIR_Allgather(705)...............: 
> MPIR_Allgather_intra(177).........: 
> MPIC_Sendrecv(189)................: 
> MPIC_Wait(528)....................: 
> MPIDI_CH3I_Progress(334)..........: 
> MPID_nem_mpich2_blocking_recv(906): 
> MPID_nem_tcp_connpoll(1875).......: 
> state_commrdy_handler(1703).......: 
> MPID_nem_tcp_recv_handler(1682)...: Communication error with rank 4
> MPID_nem_tcp_recv_handler(1582)...: socket closed
> 
>               Please let us know if you feel any other clarification regarding this error. 
> 
> Thanks,
> Krishna
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list