[mpich-discuss] Hang inside MPI_Waitall with x86_64

Anthony Chan chan at mcs.anl.gov
Mon Mar 30 18:15:20 CDT 2009


Do you really need binary compatibility between MPICH-1 and MPICH2 ?
If you have access to your source code, you can recompile it with
MPICH2 (most MPI applications can switch between different MPI
implementations) ?

A.Chan

----- "Saurabh Tendulkar" <gillette206 at yahoo.com> wrote:

> Rajeev,
> That depends, is MPICH2 supposed to be binary compatible with MPICH-1?
> I tried googling for this but couldnt find an answer. Switching to
> MPICH2 with binary compatibility would be difficult enough; without it
> would be impossible.
> 
> saurabh
> 
> --- On Mon, 3/30/09, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> 
> > From: Rajeev Thakur <thakur at mcs.anl.gov>
> > Subject: RE: [mpich-discuss] Hang inside MPI_Waitall with x86_64
> > To: gillette206 at yahoo.com, mpich-discuss at mcs.anl.gov
> > Date: Monday, March 30, 2009, 4:22 PM
> > MPICH-1 is an old implementation and no longer actively
> > supported. Can you
> > try using MPICH2 instead?
> > 
> > Rajeev 
> > 
> > > -----Original Message-----
> > > From: mpich-discuss-bounces at mcs.anl.gov 
> > > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf
> > Of 
> > > Saurabh Tendulkar
> > > Sent: Monday, March 30, 2009 3:14 PM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: [mpich-discuss] Hang inside MPI_Waitall with
> > x86_64
> > > 
> > > 
> > > Hi,
> > > I have some code that often (but not always) hangs at
> > very 
> > > similar locations inside MPI_Waitall. This happens
> > *only* on 
> > > 64-bit linux (x86_64, redhat el5, gcc 4.1) and as far
> > as I 
> > > can tell only with optimized code (-O2 for the app;
> > mpich 
> > > itself was built with default settings). I've
> > tried MPICH 
> > > 1.2.3 and the latest 1.2.7p1.
> > > 
> > > This is a 3-process run. The stack traces of the 3
> > processes 
> > > (A, B, C) are as follows (these are rank independent -
> > even 
> > > with the same mpirun settings).
> > > 
> > > A: 
> > > #0  __select_nocancel () from /lib64/libc.so.6
> > > #1  net_recv ()
> > > #2  socket_recv_on_fd ()
> > > #3  socket_recv ()
> > > #4  net_send_w ()
> > > #5  net_send ()
> > > #6  net_send2 ()
> > > #7  socket_send ()
> > > #8  send_message ()
> > > #9  MPID_CH_Rndvb_ack ()
> > > #10 MPID_CH_Check_incoming ()
> > > #11 MPID_DeviceCheck ()
> > > #12 MPID_WaitForCompleteSend ()
> > > #13 MPID_SendComplete ()
> > > #14 PMPI_Waitall ()
> > > 
> > > B:
> > > #0  __select_nocancel () from /lib64/libc.so.6
> > > #1  p4_sockets_ready ()
> > > #2  net_send_w ()
> > > #3  net_send ()
> > > #4  net_send2 ()
> > > #5  socket_send ()
> > > #6  send_message ()
> > > #7  MPID_CH_Rndvb_ack ()
> > > #8  MPID_CH_Check_incoming ()
> > > #9  MPID_DeviceCheck ()
> > > #10 MPID_WaitForCompleteSend ()
> > > #11 MPID_SendComplete ()
> > > #12 PMPI_Waitall ()
> > > Note: #0 could instead be recv ()
> > > 
> > > C:
> > > #0  __write_nocancel () from /lib64/libpthread.so.0
> > > #1  net_send_w ()
> > > #2  net_send ()
> > > #3  net_send2 ()
> > > #4  socket_send ()
> > > #5  send_message ()
> > > #6  MPID_CH_Rndvb_ack ()
> > > #7  MPID_CH_Check_incoming ()
> > > #8  MPID_DeviceCheck ()
> > > #9  MPID_WaitForCompleteSend ()
> > > #10 MPID_SendComplete ()
> > > #11 PMPI_Waitall ()
> > > Note: Instead of #0-#5 for C, there can be: (#6-#11
> > are the 
> > > same as #4-#9 here) #0  __select_nocancel () from
> > /lib64/libc.so.6
> > > #1  socket_recv ()
> > > #2  recv_message ()
> > > #3  p4_recv ()
> > > 
> > > The MPI_Waitall is after an MPI_Irecv/MPI_Isend block 
> > > exchanging data between the three processes. I have
> > verified 
> > > all counts of data etc. Note that this shows up only
> > with 
> > > 64-bit linux. It does not always happen, but when it
> > does, 
> > > it's with the stack traces as above.
> > > 
> > > I am not at all familiar with MPICH internals, so I do
> > not 
> > > know what is going on here. Can anyone shed some
> > light, and 
> > > suggest what to look for in my code that might be
> > causing 
> > > these problems?
> > > 
> > > Thank you.
> > > saurabh
> > > 
> > > 
> > > 
> > > 
> > >       
> > >


More information about the mpich-discuss mailing list