[mpich-discuss] Failures with Inter-communicator Collectives

Darius Buntinas buntinas at mcs.anl.gov
Tue Oct 12 15:37:16 CDT 2010


It looks like we're not calling the comm create hook everywhere we should, so we end up calling the comm destroy hook on something we didn't create.  I've created a ticket for this.

https://trac.mcs.anl.gov/projects/mpich2/ticket/1118

-d

On Oct 12, 2010, at 1:11 PM, Krishna Chaitanya Kandalla wrote:

> Hi Pavan, Darius, 
>              Yes. The tests that I used were icbcast and icreduce, which are a part of the MPICH2 test suite. I have placed a copy of the valgrind output for the icreduce test at : http://www.cse.ohio-state.edu/~kandalla/tmp1/icreduce_valgrind
>              Its quite a big file and I see a bunch of "invalid free()" statements there. Hope this gives you the information that you are looking for. 
> 
> Thanks,
> Krishna
> 
> On Tue, Oct 12, 2010 at 1:41 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> 
> He did: icbcast and icreduce in the MPICH2 test suite :-).
> 
>  -- Pavan
> 
> 
> On 10/12/2010 12:20 PM, Darius Buntinas wrote:
> Also, can you send us a test program?
> 
> Thanks,
> -d
> 
> On Oct 12, 2010, at 12:09 PM, Darius Buntinas wrote:
> 
> Can you guys run it through valgrind?
> 
> Thanks,
> -d
> 
> On Oct 12, 2010, at 12:04 PM, Krishna Chaitanya Kandalla wrote:
> 
> Hi MPICH2 Developers,
>      We found that if MPICH2-1.3rc2 is configured with --enable-nemesis-shm-collectives, then some of the inter-communicator collective tests in the MPICH2 test suite, such as icbcast and icreduce fail. It appears to be some form of memory corruption, given the nature of the error messages that we are seeing:
> 
> *** glibc detected *** ./icreduce: double free or corruption (!prev): 0x000000000413b720 ***
> ======= Backtrace: =========
> /lib64/libc.so.6[0x35a787230f]
> /lib64/libc.so.6(cfree+0x4b)[0x35a787276b]
> ./icreduce[0x438ae1]
> ./icreduce[0x41bd57]
> ./icreduce[0x41c085]
> ./icreduce[0x41bf86]
> ./icreduce[0x41c085]
> ./icreduce[0x418ca7]
> ./icreduce[0x404bf6]
> ./icreduce[0x40228a]
> /lib64/libc.so.6(__libc_start_main+0xf4)[0x35a781d994]
> ./icreduce[0x401f69]
> ======= Memory map: ========
> 00400000-004b6000 r-xp 00000000 00:17 2693971                            /home/kandalla/mpich2-1.3rc2/icreduce (deleted)
> 006b6000-006b8000 rw-p 000b6000 00:17 2693971                            /home/kandalla/mpich2-1.3rc2/icreduce (deleted)
> 006b8000-006df000 rw-p 006b8000 00:00 0
> 04136000-0415f000 rw-p 04136000 00:00 0                                  [heap]
> 35a7400000-35a741c000 r-xp 00000000 fd:00 4391224                        /lib64/ld-2.5.so
> 35a761b000-35a761c000 r--p 0001b000 fd:00 4391224                        /lib64/ld-2.5.so
> 35a761c000-35a761d000 rw-p 0001c000 fd:00 4391224                        /lib64/ld-2.5.so
> 35a7800000-35a794e000 r-xp 00000000 fd:00 4391225                        /lib64/libc-2.5.so
> 35a794e000-35a7b4d000 ---p 0014e000 fd:00 4391225                        /lib64/libc-2.5.so
> 35a7b4d000-35a7b51000 r--p 0014d000 fd:00 4391225                        /lib64/libc-2.5.so
> 35a7b51000-35a7b52000 rw-p 00151000 fd:00 4391225                        /lib64/libc-2.5.so
> 35a7b52000-35a7b57000 rw-p 35a7b52000 00:00 0
> 35a8400000-35a8416000 r-xp 00000000 fd:00 4391230                        /lib64/libpthread-2.5.so
> 35a8416000-35a8615000 ---p 00016000 fd:00 4391230                        /lib64/libpthread-2.5.so
> 35a8615000-35a8616000 r--p 00015000 fd:00 4391230                        /lib64/libpthread-2.5.so
> 35a8616000-35a8617000 rw-p 00016000 fd:00 4391230                        /lib64/libpthread-2.5.so
> 35a8617000-35a861b000 rw-p 35a8617000 00:00 0
> 35aa000000-35aa00d000 r-xp 00000000 fd:00 4391237                        /lib64/libgcc_s-4.1.2-20080825.so.1
> 35aa00d000-35aa20d000 ---p 0000d000 fd:00 4391237                        /lib64/libgcc_s-4.1.2-20080825.so.1
> 35aa20d000-35aa20e000 rw-p 0000d000 fd:00 4391237                        /lib64/libgcc_s-4.1.2-20080825.sPMPI_Comm_split(400)..............:
> MPIR_Comm_split_impl(88)..........:
> MPIR_Allgather_impl(744)..........:
> MPIR_Allgather(705)...............:
> MPIR_Allgather_intra(177).........:
> MPIC_Sendrecv(189)................:
> MPIC_Wait(528)....................:
> MPIDI_CH3I_Progress(334)..........:
> MPID_nem_mpich2_blocking_recv(906):
> MPID_nem_tcp_connpoll(1875).......:
> state_commrdy_handler(1703).......:
> MPID_nem_tcp_recv_handler(1682)...: Communication error with rank 4
> MPID_nem_tcp_recv_handler(1582)...: socket closed
> 
>              Please let us know if you feel any other clarification regarding this error.
> 
> Thanks,
> Krishna
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list