[mpich-discuss] MPICH deadlock error
Rajeev Thakur
thakur at mcs.anl.gov
Mon Feb 13 15:45:27 CST 2012
This will happen if the master is also sending to itself, and calls MPI_Send(to itself) before MPI_Recv(from itself). You need to either use a nonblocking send or post a nonblocking receive before the blocking send.
Rajeev
On Feb 13, 2012, at 3:28 PM, Sarika K wrote:
> Dear MPICH-discuss group:
>
> My work involves working with Fortran Code using MPICH for parallelization. But I have a very limited experience with the details of MPICH implementation. (I have always treated the MPICh part of the code as a black box).
>
> I am now working on porting the code across different machine configurations. My modeling code works fine on some machines/servers. But it also generates random MPI deadlock errors when running the simulations across other machines/servers.
>
> The error message is below.
> "Fatal error in MPI_Send: Other MPI error, error stack:
> MPI_Send(174): MPI_Send(buf=0x7f4d9b375010, count=1, dtype=USER<vector>, dest=1, tag=10001, MPI_COMM_WORLD) failed
> MPID_Send(53): DEADLOCK: attempting to send a message to the local process without a prior matching receive"
>
> I searched this list/other resources for this error code and strongly believe that there is a bug in the model MPI implementation code which remains dormant in some environments and works fine due to the internal buffering threshold dependance.
>
> I am not sure if this is sufficient information, but attached below sample subroutine (there are many inside the code) which generates the deadlock error.
>
> I would really appreciate any help/pointers from the group to fix this error in our code.
>
> Thanks in advance for your time and assistance,
> Sarika
>
> c-----------------------------------------------------------------------------------------------------------------------------
> subroutine int_distrib1(iend)
> c-----------------------
> c Master distributes another bunch of integers to Workers
> c-----------------------------------------------------------------------------------------------------------------------------
> c
> use ParallelDataMap
> use CommDataTypes
> implicit none
> include 'mpif.h'
> c
> include 'aqmax.param'
> include 'aqindx.cmm'
> c
> integer :: iend
> integer, parameter :: Nbuf=35
> integer :: i, j, k, buf(Nbuf), Ierr, status(MPI_STATUS_SIZE)
> c
> if (Master) then
> ! arguments
> buf(1) = iend
> ! /aqspid/ in aqindx.cmm stuff
> buf(2) = iair
> buf(3) = ih2o
> buf(4) = io2
> buf(5) = ico
> buf(6) = ino2
> buf(7) = iho2
> buf(8) = iso2
> buf(9) = io3
> buf(10)= ich4
> buf(11)= ico2
> buf(12)= ih2
> buf(13)= in2
> buf(14)= itrace
> k=15
> buf(k:k+9) = ispg_idx(1:10); k=k+10
> buf(k:k+9) = ispl_idx(1:10); k=k+10
>
> do i=1,Nworkers
> call MPI_SEND(buf, Nbuf, MPI_INTEGER,
> & i, i, MPI_COMM_WORLD, Ierr)
>
> enddo
> print*, ''
> print*, 'done sending int_distrib1'
> print*, ''
> endif ! (Master)
> c
> c
> if (Worker) then
> call MPI_RECV(buf, Nbuf, MPI_INTEGER, 0, MyId,
> & MPI_COMM_WORLD, status, ierr)
> iend = buf(1)
> ! /aqspid/ in aqindx.cmm stuff
> iair = buf(2)
> ih2o = buf(3)
> io2 = buf(4)
> ico = buf(5)
> ino2 = buf(6)
> iho2 = buf(7)
> iso2 = buf(8)
> io3 = buf(9)
> ich4 = buf(10)
> ico2 = buf(11)
> ih2 = buf(12)
> in2 = buf(13)
> itrace= buf(14)
> k=15
> ispg_idx(1:10) = buf(k:k+9); k=k+10
> ispl_idx(1:10) = buf(k:k+9); k=k+10
> print*, ''
> print*, 'done receiving int_distrib1'
> print*, ''
> endif ! (Worker)
> c
> end subroutine int_distrib1
>
>
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list