[mpich-discuss] Hang inside MPI_Waitall with x86_64
Saurabh Tendulkar
gillette206 at yahoo.com
Mon Mar 30 17:19:20 CDT 2009
Rajeev,
That depends, is MPICH2 supposed to be binary compatible with MPICH-1? I tried googling for this but couldnt find an answer. Switching to MPICH2 with binary compatibility would be difficult enough; without it would be impossible.
saurabh
--- On Mon, 3/30/09, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> From: Rajeev Thakur <thakur at mcs.anl.gov>
> Subject: RE: [mpich-discuss] Hang inside MPI_Waitall with x86_64
> To: gillette206 at yahoo.com, mpich-discuss at mcs.anl.gov
> Date: Monday, March 30, 2009, 4:22 PM
> MPICH-1 is an old implementation and no longer actively
> supported. Can you
> try using MPICH2 instead?
>
> Rajeev
>
> > -----Original Message-----
> > From: mpich-discuss-bounces at mcs.anl.gov
> > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf
> Of
> > Saurabh Tendulkar
> > Sent: Monday, March 30, 2009 3:14 PM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [mpich-discuss] Hang inside MPI_Waitall with
> x86_64
> >
> >
> > Hi,
> > I have some code that often (but not always) hangs at
> very
> > similar locations inside MPI_Waitall. This happens
> *only* on
> > 64-bit linux (x86_64, redhat el5, gcc 4.1) and as far
> as I
> > can tell only with optimized code (-O2 for the app;
> mpich
> > itself was built with default settings). I've
> tried MPICH
> > 1.2.3 and the latest 1.2.7p1.
> >
> > This is a 3-process run. The stack traces of the 3
> processes
> > (A, B, C) are as follows (these are rank independent -
> even
> > with the same mpirun settings).
> >
> > A:
> > #0 __select_nocancel () from /lib64/libc.so.6
> > #1 net_recv ()
> > #2 socket_recv_on_fd ()
> > #3 socket_recv ()
> > #4 net_send_w ()
> > #5 net_send ()
> > #6 net_send2 ()
> > #7 socket_send ()
> > #8 send_message ()
> > #9 MPID_CH_Rndvb_ack ()
> > #10 MPID_CH_Check_incoming ()
> > #11 MPID_DeviceCheck ()
> > #12 MPID_WaitForCompleteSend ()
> > #13 MPID_SendComplete ()
> > #14 PMPI_Waitall ()
> >
> > B:
> > #0 __select_nocancel () from /lib64/libc.so.6
> > #1 p4_sockets_ready ()
> > #2 net_send_w ()
> > #3 net_send ()
> > #4 net_send2 ()
> > #5 socket_send ()
> > #6 send_message ()
> > #7 MPID_CH_Rndvb_ack ()
> > #8 MPID_CH_Check_incoming ()
> > #9 MPID_DeviceCheck ()
> > #10 MPID_WaitForCompleteSend ()
> > #11 MPID_SendComplete ()
> > #12 PMPI_Waitall ()
> > Note: #0 could instead be recv ()
> >
> > C:
> > #0 __write_nocancel () from /lib64/libpthread.so.0
> > #1 net_send_w ()
> > #2 net_send ()
> > #3 net_send2 ()
> > #4 socket_send ()
> > #5 send_message ()
> > #6 MPID_CH_Rndvb_ack ()
> > #7 MPID_CH_Check_incoming ()
> > #8 MPID_DeviceCheck ()
> > #9 MPID_WaitForCompleteSend ()
> > #10 MPID_SendComplete ()
> > #11 PMPI_Waitall ()
> > Note: Instead of #0-#5 for C, there can be: (#6-#11
> are the
> > same as #4-#9 here) #0 __select_nocancel () from
> /lib64/libc.so.6
> > #1 socket_recv ()
> > #2 recv_message ()
> > #3 p4_recv ()
> >
> > The MPI_Waitall is after an MPI_Irecv/MPI_Isend block
> > exchanging data between the three processes. I have
> verified
> > all counts of data etc. Note that this shows up only
> with
> > 64-bit linux. It does not always happen, but when it
> does,
> > it's with the stack traces as above.
> >
> > I am not at all familiar with MPICH internals, so I do
> not
> > know what is going on here. Can anyone shed some
> light, and
> > suggest what to look for in my code that might be
> causing
> > these problems?
> >
> > Thank you.
> > saurabh
> >
> >
> >
> >
> >
> >
More information about the mpich-discuss
mailing list