[mpich-discuss] Hang inside MPI_Waitall with x86_64

Dave Goodell goodell at mcs.anl.gov
Mon Mar 30 17:34:29 CDT 2009


MPICH1 and MPICH2 are absolutely not binary compatible.  Please also  
note that MPICH1 is not guaranteed to be binary compatible from one  
version to the next.  The same is true for MPICH2.  It is entirely  
possible for us to change something in mpi.h between releases,  
although it happens relatively infrequently.

-Dave

On Mar 30, 2009, at 5:19 PM, Saurabh Tendulkar wrote:

>
>
> Rajeev,
> That depends, is MPICH2 supposed to be binary compatible with  
> MPICH-1? I tried googling for this but couldnt find an answer.  
> Switching to MPICH2 with binary compatibility would be difficult  
> enough; without it would be impossible.
>
> saurabh
>
> --- On Mon, 3/30/09, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
>
>> From: Rajeev Thakur <thakur at mcs.anl.gov>
>> Subject: RE: [mpich-discuss] Hang inside MPI_Waitall with x86_64
>> To: gillette206 at yahoo.com, mpich-discuss at mcs.anl.gov
>> Date: Monday, March 30, 2009, 4:22 PM
>> MPICH-1 is an old implementation and no longer actively
>> supported. Can you
>> try using MPICH2 instead?
>>
>> Rajeev
>>
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf
>> Of
>>> Saurabh Tendulkar
>>> Sent: Monday, March 30, 2009 3:14 PM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: [mpich-discuss] Hang inside MPI_Waitall with
>> x86_64
>>>
>>>
>>> Hi,
>>> I have some code that often (but not always) hangs at
>> very
>>> similar locations inside MPI_Waitall. This happens
>> *only* on
>>> 64-bit linux (x86_64, redhat el5, gcc 4.1) and as far
>> as I
>>> can tell only with optimized code (-O2 for the app;
>> mpich
>>> itself was built with default settings). I've
>> tried MPICH
>>> 1.2.3 and the latest 1.2.7p1.
>>>
>>> This is a 3-process run. The stack traces of the 3
>> processes
>>> (A, B, C) are as follows (these are rank independent -
>> even
>>> with the same mpirun settings).
>>>
>>> A:
>>> #0  __select_nocancel () from /lib64/libc.so.6
>>> #1  net_recv ()
>>> #2  socket_recv_on_fd ()
>>> #3  socket_recv ()
>>> #4  net_send_w ()
>>> #5  net_send ()
>>> #6  net_send2 ()
>>> #7  socket_send ()
>>> #8  send_message ()
>>> #9  MPID_CH_Rndvb_ack ()
>>> #10 MPID_CH_Check_incoming ()
>>> #11 MPID_DeviceCheck ()
>>> #12 MPID_WaitForCompleteSend ()
>>> #13 MPID_SendComplete ()
>>> #14 PMPI_Waitall ()
>>>
>>> B:
>>> #0  __select_nocancel () from /lib64/libc.so.6
>>> #1  p4_sockets_ready ()
>>> #2  net_send_w ()
>>> #3  net_send ()
>>> #4  net_send2 ()
>>> #5  socket_send ()
>>> #6  send_message ()
>>> #7  MPID_CH_Rndvb_ack ()
>>> #8  MPID_CH_Check_incoming ()
>>> #9  MPID_DeviceCheck ()
>>> #10 MPID_WaitForCompleteSend ()
>>> #11 MPID_SendComplete ()
>>> #12 PMPI_Waitall ()
>>> Note: #0 could instead be recv ()
>>>
>>> C:
>>> #0  __write_nocancel () from /lib64/libpthread.so.0
>>> #1  net_send_w ()
>>> #2  net_send ()
>>> #3  net_send2 ()
>>> #4  socket_send ()
>>> #5  send_message ()
>>> #6  MPID_CH_Rndvb_ack ()
>>> #7  MPID_CH_Check_incoming ()
>>> #8  MPID_DeviceCheck ()
>>> #9  MPID_WaitForCompleteSend ()
>>> #10 MPID_SendComplete ()
>>> #11 PMPI_Waitall ()
>>> Note: Instead of #0-#5 for C, there can be: (#6-#11
>> are the
>>> same as #4-#9 here) #0  __select_nocancel () from
>> /lib64/libc.so.6
>>> #1  socket_recv ()
>>> #2  recv_message ()
>>> #3  p4_recv ()
>>>
>>> The MPI_Waitall is after an MPI_Irecv/MPI_Isend block
>>> exchanging data between the three processes. I have
>> verified
>>> all counts of data etc. Note that this shows up only
>> with
>>> 64-bit linux. It does not always happen, but when it
>> does,
>>> it's with the stack traces as above.
>>>
>>> I am not at all familiar with MPICH internals, so I do
>> not
>>> know what is going on here. Can anyone shed some
>> light, and
>>> suggest what to look for in my code that might be
>> causing
>>> these problems?
>>>
>>> Thank you.
>>> saurabh
>>>
>>>
>>>
>>>
>>>
>>>
>
>
>



More information about the mpich-discuss mailing list