[mpich-discuss] Failures with Inter-communicator Collectives

Krishna Chaitanya Kandalla kandalla at cse.ohio-state.edu
Tue Oct 12 15:57:46 CDT 2010


Darius,
           Thanks for the update.

Regards,
Krishna

On Tue, Oct 12, 2010 at 4:37 PM, Darius Buntinas <buntinas at mcs.anl.gov>wrote:

>
> It looks like we're not calling the comm create hook everywhere we should,
> so we end up calling the comm destroy hook on something we didn't create.
>  I've created a ticket for this.
>
> https://trac.mcs.anl.gov/projects/mpich2/ticket/1118
>
> -d
>
> On Oct 12, 2010, at 1:11 PM, Krishna Chaitanya Kandalla wrote:
>
> > Hi Pavan, Darius,
> >              Yes. The tests that I used were icbcast and icreduce, which
> are a part of the MPICH2 test suite. I have placed a copy of the valgrind
> output for the icreduce test at :
> http://www.cse.ohio-state.edu/~kandalla/tmp1/icreduce_valgrind
> >              Its quite a big file and I see a bunch of "invalid free()"
> statements there. Hope this gives you the information that you are looking
> for.
> >
> > Thanks,
> > Krishna
> >
> > On Tue, Oct 12, 2010 at 1:41 PM, Pavan Balaji <balaji at mcs.anl.gov>
> wrote:
> >
> > He did: icbcast and icreduce in the MPICH2 test suite :-).
> >
> >  -- Pavan
> >
> >
> > On 10/12/2010 12:20 PM, Darius Buntinas wrote:
> > Also, can you send us a test program?
> >
> > Thanks,
> > -d
> >
> > On Oct 12, 2010, at 12:09 PM, Darius Buntinas wrote:
> >
> > Can you guys run it through valgrind?
> >
> > Thanks,
> > -d
> >
> > On Oct 12, 2010, at 12:04 PM, Krishna Chaitanya Kandalla wrote:
> >
> > Hi MPICH2 Developers,
> >      We found that if MPICH2-1.3rc2 is configured with
> --enable-nemesis-shm-collectives, then some of the inter-communicator
> collective tests in the MPICH2 test suite, such as icbcast and icreduce
> fail. It appears to be some form of memory corruption, given the nature of
> the error messages that we are seeing:
> >
> > *** glibc detected *** ./icreduce: double free or corruption (!prev):
> 0x000000000413b720 ***
> > ======= Backtrace: =========
> > /lib64/libc.so.6[0x35a787230f]
> > /lib64/libc.so.6(cfree+0x4b)[0x35a787276b]
> > ./icreduce[0x438ae1]
> > ./icreduce[0x41bd57]
> > ./icreduce[0x41c085]
> > ./icreduce[0x41bf86]
> > ./icreduce[0x41c085]
> > ./icreduce[0x418ca7]
> > ./icreduce[0x404bf6]
> > ./icreduce[0x40228a]
> > /lib64/libc.so.6(__libc_start_main+0xf4)[0x35a781d994]
> > ./icreduce[0x401f69]
> > ======= Memory map: ========
> > 00400000-004b6000 r-xp 00000000 00:17 2693971
>  /home/kandalla/mpich2-1.3rc2/icreduce (deleted)
> > 006b6000-006b8000 rw-p 000b6000 00:17 2693971
>  /home/kandalla/mpich2-1.3rc2/icreduce (deleted)
> > 006b8000-006df000 rw-p 006b8000 00:00 0
> > 04136000-0415f000 rw-p 04136000 00:00 0
>  [heap]
> > 35a7400000-35a741c000 r-xp 00000000 fd:00 4391224
>  /lib64/ld-2.5.so
> > 35a761b000-35a761c000 r--p 0001b000 fd:00 4391224
>  /lib64/ld-2.5.so
> > 35a761c000-35a761d000 rw-p 0001c000 fd:00 4391224
>  /lib64/ld-2.5.so
> > 35a7800000-35a794e000 r-xp 00000000 fd:00 4391225
>  /lib64/libc-2.5.so
> > 35a794e000-35a7b4d000 ---p 0014e000 fd:00 4391225
>  /lib64/libc-2.5.so
> > 35a7b4d000-35a7b51000 r--p 0014d000 fd:00 4391225
>  /lib64/libc-2.5.so
> > 35a7b51000-35a7b52000 rw-p 00151000 fd:00 4391225
>  /lib64/libc-2.5.so
> > 35a7b52000-35a7b57000 rw-p 35a7b52000 00:00 0
> > 35a8400000-35a8416000 r-xp 00000000 fd:00 4391230
>  /lib64/libpthread-2.5.so
> > 35a8416000-35a8615000 ---p 00016000 fd:00 4391230
>  /lib64/libpthread-2.5.so
> > 35a8615000-35a8616000 r--p 00015000 fd:00 4391230
>  /lib64/libpthread-2.5.so
> > 35a8616000-35a8617000 rw-p 00016000 fd:00 4391230
>  /lib64/libpthread-2.5.so
> > 35a8617000-35a861b000 rw-p 35a8617000 00:00 0
> > 35aa000000-35aa00d000 r-xp 00000000 fd:00 4391237
>  /lib64/libgcc_s-4.1.2-20080825.so.1
> > 35aa00d000-35aa20d000 ---p 0000d000 fd:00 4391237
>  /lib64/libgcc_s-4.1.2-20080825.so.1
> > 35aa20d000-35aa20e000 rw-p 0000d000 fd:00 4391237
>  /lib64/libgcc_s-4.1.2-20080825.sPMPI_Comm_split(400)..............:
> > MPIR_Comm_split_impl(88)..........:
> > MPIR_Allgather_impl(744)..........:
> > MPIR_Allgather(705)...............:
> > MPIR_Allgather_intra(177).........:
> > MPIC_Sendrecv(189)................:
> > MPIC_Wait(528)....................:
> > MPIDI_CH3I_Progress(334)..........:
> > MPID_nem_mpich2_blocking_recv(906):
> > MPID_nem_tcp_connpoll(1875).......:
> > state_commrdy_handler(1703).......:
> > MPID_nem_tcp_recv_handler(1682)...: Communication error with rank 4
> > MPID_nem_tcp_recv_handler(1582)...: socket closed
> >
> >              Please let us know if you feel any other clarification
> regarding this error.
> >
> > Thanks,
> > Krishna
> >
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101012/75397bef/attachment.htm>


More information about the mpich-discuss mailing list