[mpich-discuss] Failures with Inter-communicator Collectives

Krishna Chaitanya Kandalla kandalla at cse.ohio-state.edu
Tue Oct 12 13:11:30 CDT 2010


Hi Pavan, Darius,
             Yes. The tests that I used were icbcast and icreduce, which are
a part of the MPICH2 test suite. I have placed a copy of the valgrind output
for the icreduce test at :
http://www.cse.ohio-state.edu/~kandalla/tmp1/icreduce_valgrind
             Its quite a big file and I see a bunch of "invalid free()"
statements there. Hope this gives you the information that you are looking
for.

Thanks,
Krishna

On Tue, Oct 12, 2010 at 1:41 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

>
> He did: icbcast and icreduce in the MPICH2 test suite :-).
>
>  -- Pavan
>
>
> On 10/12/2010 12:20 PM, Darius Buntinas wrote:
>
>> Also, can you send us a test program?
>>
>> Thanks,
>> -d
>>
>> On Oct 12, 2010, at 12:09 PM, Darius Buntinas wrote:
>>
>>  Can you guys run it through valgrind?
>>>
>>> Thanks,
>>> -d
>>>
>>> On Oct 12, 2010, at 12:04 PM, Krishna Chaitanya Kandalla wrote:
>>>
>>>  Hi MPICH2 Developers,
>>>>      We found that if MPICH2-1.3rc2 is configured with
>>>> --enable-nemesis-shm-collectives, then some of the inter-communicator
>>>> collective tests in the MPICH2 test suite, such as icbcast and icreduce
>>>> fail. It appears to be some form of memory corruption, given the nature of
>>>> the error messages that we are seeing:
>>>>
>>>> *** glibc detected *** ./icreduce: double free or corruption (!prev):
>>>> 0x000000000413b720 ***
>>>> ======= Backtrace: =========
>>>> /lib64/libc.so.6[0x35a787230f]
>>>> /lib64/libc.so.6(cfree+0x4b)[0x35a787276b]
>>>> ./icreduce[0x438ae1]
>>>> ./icreduce[0x41bd57]
>>>> ./icreduce[0x41c085]
>>>> ./icreduce[0x41bf86]
>>>> ./icreduce[0x41c085]
>>>> ./icreduce[0x418ca7]
>>>> ./icreduce[0x404bf6]
>>>> ./icreduce[0x40228a]
>>>> /lib64/libc.so.6(__libc_start_main+0xf4)[0x35a781d994]
>>>> ./icreduce[0x401f69]
>>>> ======= Memory map: ========
>>>> 00400000-004b6000 r-xp 00000000 00:17 2693971
>>>>  /home/kandalla/mpich2-1.3rc2/icreduce (deleted)
>>>> 006b6000-006b8000 rw-p 000b6000 00:17 2693971
>>>>  /home/kandalla/mpich2-1.3rc2/icreduce (deleted)
>>>> 006b8000-006df000 rw-p 006b8000 00:00 0
>>>> 04136000-0415f000 rw-p 04136000 00:00 0
>>>>  [heap]
>>>> 35a7400000-35a741c000 r-xp 00000000 fd:00 4391224
>>>>  /lib64/ld-2.5.so
>>>> 35a761b000-35a761c000 r--p 0001b000 fd:00 4391224
>>>>  /lib64/ld-2.5.so
>>>> 35a761c000-35a761d000 rw-p 0001c000 fd:00 4391224
>>>>  /lib64/ld-2.5.so
>>>> 35a7800000-35a794e000 r-xp 00000000 fd:00 4391225
>>>>  /lib64/libc-2.5.so
>>>> 35a794e000-35a7b4d000 ---p 0014e000 fd:00 4391225
>>>>  /lib64/libc-2.5.so
>>>> 35a7b4d000-35a7b51000 r--p 0014d000 fd:00 4391225
>>>>  /lib64/libc-2.5.so
>>>> 35a7b51000-35a7b52000 rw-p 00151000 fd:00 4391225
>>>>  /lib64/libc-2.5.so
>>>> 35a7b52000-35a7b57000 rw-p 35a7b52000 00:00 0
>>>> 35a8400000-35a8416000 r-xp 00000000 fd:00 4391230
>>>>  /lib64/libpthread-2.5.so
>>>> 35a8416000-35a8615000 ---p 00016000 fd:00 4391230
>>>>  /lib64/libpthread-2.5.so
>>>> 35a8615000-35a8616000 r--p 00015000 fd:00 4391230
>>>>  /lib64/libpthread-2.5.so
>>>> 35a8616000-35a8617000 rw-p 00016000 fd:00 4391230
>>>>  /lib64/libpthread-2.5.so
>>>> 35a8617000-35a861b000 rw-p 35a8617000 00:00 0
>>>> 35aa000000-35aa00d000 r-xp 00000000 fd:00 4391237
>>>>  /lib64/libgcc_s-4.1.2-20080825.so.1
>>>> 35aa00d000-35aa20d000 ---p 0000d000 fd:00 4391237
>>>>  /lib64/libgcc_s-4.1.2-20080825.so.1
>>>> 35aa20d000-35aa20e000 rw-p 0000d000 fd:00 4391237
>>>>  /lib64/libgcc_s-4.1.2-20080825.sPMPI_Comm_split(400)..............:
>>>> MPIR_Comm_split_impl(88)..........:
>>>> MPIR_Allgather_impl(744)..........:
>>>> MPIR_Allgather(705)...............:
>>>> MPIR_Allgather_intra(177).........:
>>>> MPIC_Sendrecv(189)................:
>>>> MPIC_Wait(528)....................:
>>>> MPIDI_CH3I_Progress(334)..........:
>>>> MPID_nem_mpich2_blocking_recv(906):
>>>> MPID_nem_tcp_connpoll(1875).......:
>>>> state_commrdy_handler(1703).......:
>>>> MPID_nem_tcp_recv_handler(1682)...: Communication error with rank 4
>>>> MPID_nem_tcp_recv_handler(1582)...: socket closed
>>>>
>>>>              Please let us know if you feel any other clarification
>>>> regarding this error.
>>>>
>>>> Thanks,
>>>> Krishna
>>>>
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>
>>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101012/bdbe98c1/attachment-0001.htm>


More information about the mpich-discuss mailing list