[mpich-discuss] Hang inside MPI_Waitall with x86_64

Rajeev Thakur thakur at mcs.anl.gov
Mon Mar 30 15:22:10 CDT 2009


MPICH-1 is an old implementation and no longer actively supported. Can you
try using MPICH2 instead?

Rajeev 

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
> Saurabh Tendulkar
> Sent: Monday, March 30, 2009 3:14 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] Hang inside MPI_Waitall with x86_64
> 
> 
> Hi,
> I have some code that often (but not always) hangs at very 
> similar locations inside MPI_Waitall. This happens *only* on 
> 64-bit linux (x86_64, redhat el5, gcc 4.1) and as far as I 
> can tell only with optimized code (-O2 for the app; mpich 
> itself was built with default settings). I've tried MPICH 
> 1.2.3 and the latest 1.2.7p1.
> 
> This is a 3-process run. The stack traces of the 3 processes 
> (A, B, C) are as follows (these are rank independent - even 
> with the same mpirun settings).
> 
> A: 
> #0  __select_nocancel () from /lib64/libc.so.6
> #1  net_recv ()
> #2  socket_recv_on_fd ()
> #3  socket_recv ()
> #4  net_send_w ()
> #5  net_send ()
> #6  net_send2 ()
> #7  socket_send ()
> #8  send_message ()
> #9  MPID_CH_Rndvb_ack ()
> #10 MPID_CH_Check_incoming ()
> #11 MPID_DeviceCheck ()
> #12 MPID_WaitForCompleteSend ()
> #13 MPID_SendComplete ()
> #14 PMPI_Waitall ()
> 
> B:
> #0  __select_nocancel () from /lib64/libc.so.6
> #1  p4_sockets_ready ()
> #2  net_send_w ()
> #3  net_send ()
> #4  net_send2 ()
> #5  socket_send ()
> #6  send_message ()
> #7  MPID_CH_Rndvb_ack ()
> #8  MPID_CH_Check_incoming ()
> #9  MPID_DeviceCheck ()
> #10 MPID_WaitForCompleteSend ()
> #11 MPID_SendComplete ()
> #12 PMPI_Waitall ()
> Note: #0 could instead be recv ()
> 
> C:
> #0  __write_nocancel () from /lib64/libpthread.so.0
> #1  net_send_w ()
> #2  net_send ()
> #3  net_send2 ()
> #4  socket_send ()
> #5  send_message ()
> #6  MPID_CH_Rndvb_ack ()
> #7  MPID_CH_Check_incoming ()
> #8  MPID_DeviceCheck ()
> #9  MPID_WaitForCompleteSend ()
> #10 MPID_SendComplete ()
> #11 PMPI_Waitall ()
> Note: Instead of #0-#5 for C, there can be: (#6-#11 are the 
> same as #4-#9 here) #0  __select_nocancel () from /lib64/libc.so.6
> #1  socket_recv ()
> #2  recv_message ()
> #3  p4_recv ()
> 
> The MPI_Waitall is after an MPI_Irecv/MPI_Isend block 
> exchanging data between the three processes. I have verified 
> all counts of data etc. Note that this shows up only with 
> 64-bit linux. It does not always happen, but when it does, 
> it's with the stack traces as above.
> 
> I am not at all familiar with MPICH internals, so I do not 
> know what is going on here. Can anyone shed some light, and 
> suggest what to look for in my code that might be causing 
> these problems?
> 
> Thank you.
> saurabh
> 
> 
> 
> 
>       
> 



More information about the mpich-discuss mailing list