[mpich-discuss] Follow up: MPICH deadlock error

Sarika K sarikauniv at gmail.com
Fri May 11 18:01:04 CDT 2012


Thanks! Gus and Rajeev for your feedback on my MPICH deadlock error post
earlier in February.
Our model code now works successfully with the latest stable mpich2 (mpich2
1.4.1p1)  version after modifying the MPI calls as per your suggestions.

The code uses Master/Slave MPI implementation and was originally set up for
a cluster of single core nodes controlled by a head node.

The code also works well with a single node multiple cores set up.

Now I am testing this code on mutiple nodes each with multiple cores. The
nodes are set up for password less login via ssh. The run hangs sometimes
and takes longer to complete. (The run with single node 12 cores is faster
than the corresponding run using  2 nodes each with 12 cores (total 24)).

I am trying to resolve this issue and wondering if you have any feedback on:
1. Does Master/Slave implementation need any specific settings to work
across mutliple node/mutliple core machine set up?
2. Is there any way to explicitly specify a core as a master with mpich2
and exclude it from computations?

I would greatly appreciate any other pointers/suggestions about this issue.

Thanks,
Sarika



---------- Forwarded message ----------
From: Gustavo Correa <gus at ldeo.columbia.edu>
Date: Tue, Feb 14, 2012 at 10:04 AM
Subject: Re: [mpich-discuss] MPICH deadlock error
To: mpich-discuss at mcs.anl.gov



On Feb 13, 2012, at 8:05 PM, Sarika K wrote:

> Thanks! Gus. I appreciate your feedback.
>
> I looked through the MPI communication and driver code (written way back
in 2001) , it does includes calls to MPI_Bcast (Attached below is a sample
code). I am not sure why MPI_send /MPI_Recv is used in some places and
MPI_Bcast in others.  I have started to learn more details about MPI calls
only after encountering this deadlock error. Are there any particular
cases/instances where MPI_send /MPI_Recv call set up is preferred over
MPI_bcast or vice-versa?
>
> Best regards,
> Sarika

Hi Sarika

When one process sends the *same* data to all other processes in a
communicator,
the situation begs for MPI_Bcast.
Point to point communication (send/recv) is preferred when the data
exchanged across
different pairs of processes is different.
MPI_Bcast is but one type of MPI collective procedures.
There are several others with different goals [e.g. MPI_Scatter[v] and
MPI_Gather[v], say,
when parts of an array is distributed/collected by a master process].

Check these MPI tutorials:
http://www.citutor.org/browse.php
https://computing.llnl.gov/tutorials/mpi/
http://www.mcs.anl.gov/research/projects/mpi/tutorial/
http://www.mcs.anl.gov/research/projects/mpi/tutorial/gropp/talk.html
http://www.cs.usfca.edu/~peter/ppmpi/

I hope this helps,
Gus Correa

>
> (MPI_Bcast sample code)
>       Nbuf = ix+iy+iz+is+4
>       if (Master) then
>         k=0
>     buf(k+1:k+ix)=dx(1:ix) ; k=k+ix
>     buf(k+1:k+iy)=dy(1:iy) ; k=k+iy
>     buf(k+1:k+iz)=sigmaz(1:iz) ; k=k+iz
>     buf(k+1)=dht ; k=k+1
>     buf(k+1)=baseh ; k=k+1
>     buf(k+1:k+is)=rmw(1:is) ; k=k+is
>     buf(k+1)=dt ; k=k+1
>     buf(k+1)=ut ; k=k+1
>     if (k.ne.Nbuf) then
>       print*, 'Error in real_distrib. Nbuf=',Nbuf,
>      &             '   needed=',k
>           stop
>     endif
>       endif
> c
>       call MPI_BCAST(buf(1),Nbuf,MPI_REAL,0,MPI_COMM_WORLD,Ierr)
> c
>       if (Worker) then
>         k=0
>     dx(1:ix)=buf(1:ix) ; k=ix
>     dy(1:iy)=buf(k+1:k+iy) ; k=k+iy
>     sigmaz(1:iz)=buf(k+1:k+iz) ; k=k+iz
>     dht=buf(k+1) ; k=k+1
>     baseh=buf(k+1) ; k=k+1
>     rmw(1:is)=buf(k+1:k+is); k=k+is
>     dt=buf(k+1); k=k+1
>     ut=buf(k+1); k=k+1
>       endif
>
> On Mon, Feb 13, 2012 at 4:05 PM, Gustavo Correa <gus at ldeo.columbia.edu>
wrote:
> Hi Sarika
>
> I think you may also need an MPI_Wait or MPI_Waitall after the MPI_Recv.
>
> However, your code seems to broadcast  the same 'buf' from Master to all
Workers, right?
> Have you tried to use MPI_Bcast, instead of MPI_Send & MPI_Recv?
> A collective call, and is likely to perform better.
> Something like this:
>
> if (Master) then
>  ...load buf with data
> endif
>
> call MPI_Bcast(buf, ...)  ! every process calls it
>
> if (Worker) then
>  ... unload data from buf
> endif
>
> I hope this helps,
> Gus Correa
>
> On Feb 13, 2012, at 5:03 PM, Sarika K wrote:
>
> > Thanks! Rajeev for the quick feedback. I really appreciate it.  I have
used but never never written/modified MPI code. I am assuming that I need
to use the nonblocking routine MPI_Isend within the if (master) part of the
sample code. Is that right?
> >
> > Best regards,
> > Sarika
> >
> >
> > On Mon, Feb 13, 2012 at 1:45 PM, Rajeev Thakur <thakur at mcs.anl.gov>
wrote:
> > This will happen if the master is also sending to itself, and calls
MPI_Send(to itself) before MPI_Recv(from itself). You need to either use a
nonblocking send or post a nonblocking receive before the blocking send.
> >
> > Rajeev
> >
> >
> > On Feb 13, 2012, at 3:28 PM, Sarika K wrote:
> >
> > > Dear MPICH-discuss group:
> > >
> > > My work involves working with Fortran Code using MPICH for
parallelization. But I have a very limited experience with the details of
MPICH implementation. (I have always treated the MPICh part of the code as
a black box).
> > >
> > > I am now working on porting the code across different machine
configurations. My modeling code works fine on some machines/servers. But
it also generates random MPI deadlock errors when running the simulations
across other machines/servers.
> > >
> > > The error message is below.
> > > "Fatal error in MPI_Send: Other MPI error, error stack:
> > > MPI_Send(174): MPI_Send(buf=0x7f4d9b375010, count=1,
dtype=USER<vector>, dest=1, tag=10001, MPI_COMM_WORLD) failed
> > > MPID_Send(53): DEADLOCK: attempting to send a message to the local
process without a prior matching receive"
> > >
> > > I searched this list/other resources for this error code and strongly
believe that there is a bug in the model MPI implementation code which
remains dormant in some environments and works fine due to the internal
buffering threshold dependance.
> > >
> > > I am not sure if this is sufficient information, but attached below
sample subroutine (there are many inside the code) which generates the
deadlock error.
> > >
> > > I would really appreciate any help/pointers from the group to fix
this error in our code.
> > >
> > > Thanks in advance for your time and assistance,
> > > Sarika
> > >
> > >
c-----------------------------------------------------------------------------------------------------------------------------
> > >       subroutine int_distrib1(iend)
> > > c-----------------------
> > > c  Master distributes another bunch of integers to Workers
> > >
c-----------------------------------------------------------------------------------------------------------------------------
> > > c
> > >       use ParallelDataMap
> > >       use CommDataTypes
> > >       implicit none
> > >       include 'mpif.h'
> > > c
> > >       include 'aqmax.param'
> > >       include 'aqindx.cmm'
> > > c
> > >       integer :: iend
> > >       integer, parameter ::  Nbuf=35
> > >       integer ::  i, j, k, buf(Nbuf), Ierr, status(MPI_STATUS_SIZE)
> > > c
> > >       if (Master) then
> > > ! arguments
> > >     buf(1) = iend
> > > !  /aqspid/ in aqindx.cmm stuff
> > >     buf(2) = iair
> > >     buf(3) = ih2o
> > >     buf(4) = io2
> > >     buf(5) = ico
> > >     buf(6) = ino2
> > >     buf(7) = iho2
> > >     buf(8) = iso2
> > >     buf(9) = io3
> > >     buf(10)= ich4
> > >     buf(11)= ico2
> > >     buf(12)= ih2
> > >     buf(13)= in2
> > >     buf(14)= itrace
> > >     k=15
> > >     buf(k:k+9) = ispg_idx(1:10); k=k+10
> > >     buf(k:k+9) = ispl_idx(1:10); k=k+10
> > >
> > >     do i=1,Nworkers
> > >       call MPI_SEND(buf, Nbuf, MPI_INTEGER,
> > >      &         i, i,  MPI_COMM_WORLD, Ierr)
> > >
> > >     enddo
> > >     print*, ''
> > >     print*, 'done sending int_distrib1'
> > >     print*, ''
> > >       endif   !   (Master)
> > > c
> > > c
> > >       if (Worker) then
> > >         call MPI_RECV(buf, Nbuf, MPI_INTEGER, 0, MyId,
> > >      &                 MPI_COMM_WORLD, status, ierr)
> > >     iend  = buf(1)
> > > ! /aqspid/ in aqindx.cmm stuff
> > >     iair  = buf(2)
> > >     ih2o  = buf(3)
> > >     io2   = buf(4)
> > >     ico   = buf(5)
> > >     ino2  = buf(6)
> > >     iho2  = buf(7)
> > >     iso2  = buf(8)
> > >     io3   = buf(9)
> > >     ich4  = buf(10)
> > >     ico2  = buf(11)
> > >     ih2   = buf(12)
> > >     in2   = buf(13)
> > >     itrace= buf(14)
> > >     k=15
> > >     ispg_idx(1:10) = buf(k:k+9); k=k+10
> > >     ispl_idx(1:10) = buf(k:k+9); k=k+10
> > >     print*, ''
> > >     print*, 'done receiving int_distrib1'
> > >     print*, ''
> > >       endif  !    (Worker)
> > > c
> > >       end  subroutine int_distrib1
> > >
> > >
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > > To manage subscription options or unsubscribe:
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120511/9452584c/attachment-0001.htm>


More information about the mpich-discuss mailing list