[mpich-discuss] Hang inside MPI_Waitall with x86_64

Gus Correa gus at ldeo.columbia.edu
Mon Mar 30 15:42:50 CDT 2009


Rajeev Thakur wrote:
> MPICH-1 is an old implementation and no longer actively supported. Can you
> try using MPICH2 instead?

... and configure it with the *nemesis* communication device.

BTW, if you are seeing random "p4" errors with mpich-1,
see this thread:

http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2

My two cents,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

> 
> Rajeev 
> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
>> Saurabh Tendulkar
>> Sent: Monday, March 30, 2009 3:14 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: [mpich-discuss] Hang inside MPI_Waitall with x86_64
>>
>>
>> Hi,
>> I have some code that often (but not always) hangs at very 
>> similar locations inside MPI_Waitall. This happens *only* on 
>> 64-bit linux (x86_64, redhat el5, gcc 4.1) and as far as I 
>> can tell only with optimized code (-O2 for the app; mpich 
>> itself was built with default settings). I've tried MPICH 
>> 1.2.3 and the latest 1.2.7p1.
>>
>> This is a 3-process run. The stack traces of the 3 processes 
>> (A, B, C) are as follows (these are rank independent - even 
>> with the same mpirun settings).
>>
>> A: 
>> #0  __select_nocancel () from /lib64/libc.so.6
>> #1  net_recv ()
>> #2  socket_recv_on_fd ()
>> #3  socket_recv ()
>> #4  net_send_w ()
>> #5  net_send ()
>> #6  net_send2 ()
>> #7  socket_send ()
>> #8  send_message ()
>> #9  MPID_CH_Rndvb_ack ()
>> #10 MPID_CH_Check_incoming ()
>> #11 MPID_DeviceCheck ()
>> #12 MPID_WaitForCompleteSend ()
>> #13 MPID_SendComplete ()
>> #14 PMPI_Waitall ()
>>
>> B:
>> #0  __select_nocancel () from /lib64/libc.so.6
>> #1  p4_sockets_ready ()
>> #2  net_send_w ()
>> #3  net_send ()
>> #4  net_send2 ()
>> #5  socket_send ()
>> #6  send_message ()
>> #7  MPID_CH_Rndvb_ack ()
>> #8  MPID_CH_Check_incoming ()
>> #9  MPID_DeviceCheck ()
>> #10 MPID_WaitForCompleteSend ()
>> #11 MPID_SendComplete ()
>> #12 PMPI_Waitall ()
>> Note: #0 could instead be recv ()
>>
>> C:
>> #0  __write_nocancel () from /lib64/libpthread.so.0
>> #1  net_send_w ()
>> #2  net_send ()
>> #3  net_send2 ()
>> #4  socket_send ()
>> #5  send_message ()
>> #6  MPID_CH_Rndvb_ack ()
>> #7  MPID_CH_Check_incoming ()
>> #8  MPID_DeviceCheck ()
>> #9  MPID_WaitForCompleteSend ()
>> #10 MPID_SendComplete ()
>> #11 PMPI_Waitall ()
>> Note: Instead of #0-#5 for C, there can be: (#6-#11 are the 
>> same as #4-#9 here) #0  __select_nocancel () from /lib64/libc.so.6
>> #1  socket_recv ()
>> #2  recv_message ()
>> #3  p4_recv ()
>>
>> The MPI_Waitall is after an MPI_Irecv/MPI_Isend block 
>> exchanging data between the three processes. I have verified 
>> all counts of data etc. Note that this shows up only with 
>> 64-bit linux. It does not always happen, but when it does, 
>> it's with the stack traces as above.
>>
>> I am not at all familiar with MPICH internals, so I do not 
>> know what is going on here. Can anyone shed some light, and 
>> suggest what to look for in my code that might be causing 
>> these problems?
>>
>> Thank you.
>> saurabh
>>
>>
>>
>>
>>       
>>



More information about the mpich-discuss mailing list