[mpich-discuss] Hang inside MPI_Waitall with x86_64
Saurabh Tendulkar
gillette206 at yahoo.com
Mon Mar 30 15:13:32 CDT 2009
Hi,
I have some code that often (but not always) hangs at very similar locations inside MPI_Waitall. This happens *only* on 64-bit linux (x86_64, redhat el5, gcc 4.1) and as far as I can tell only with optimized code (-O2 for the app; mpich itself was built with default settings). I've tried MPICH 1.2.3 and the latest 1.2.7p1.
This is a 3-process run. The stack traces of the 3 processes (A, B, C) are as follows (these are rank independent - even with the same mpirun settings).
A:
#0 __select_nocancel () from /lib64/libc.so.6
#1 net_recv ()
#2 socket_recv_on_fd ()
#3 socket_recv ()
#4 net_send_w ()
#5 net_send ()
#6 net_send2 ()
#7 socket_send ()
#8 send_message ()
#9 MPID_CH_Rndvb_ack ()
#10 MPID_CH_Check_incoming ()
#11 MPID_DeviceCheck ()
#12 MPID_WaitForCompleteSend ()
#13 MPID_SendComplete ()
#14 PMPI_Waitall ()
B:
#0 __select_nocancel () from /lib64/libc.so.6
#1 p4_sockets_ready ()
#2 net_send_w ()
#3 net_send ()
#4 net_send2 ()
#5 socket_send ()
#6 send_message ()
#7 MPID_CH_Rndvb_ack ()
#8 MPID_CH_Check_incoming ()
#9 MPID_DeviceCheck ()
#10 MPID_WaitForCompleteSend ()
#11 MPID_SendComplete ()
#12 PMPI_Waitall ()
Note: #0 could instead be recv ()
C:
#0 __write_nocancel () from /lib64/libpthread.so.0
#1 net_send_w ()
#2 net_send ()
#3 net_send2 ()
#4 socket_send ()
#5 send_message ()
#6 MPID_CH_Rndvb_ack ()
#7 MPID_CH_Check_incoming ()
#8 MPID_DeviceCheck ()
#9 MPID_WaitForCompleteSend ()
#10 MPID_SendComplete ()
#11 PMPI_Waitall ()
Note: Instead of #0-#5 for C, there can be: (#6-#11 are the same as #4-#9 here)
#0 __select_nocancel () from /lib64/libc.so.6
#1 socket_recv ()
#2 recv_message ()
#3 p4_recv ()
The MPI_Waitall is after an MPI_Irecv/MPI_Isend block exchanging data between the three processes. I have verified all counts of data etc. Note that this shows up only with 64-bit linux. It does not always happen, but when it does, it's with the stack traces as above.
I am not at all familiar with MPICH internals, so I do not know what is going on here. Can anyone shed some light, and suggest what to look for in my code that might be causing these problems?
Thank you.
saurabh
More information about the mpich-discuss
mailing list