[mpich-discuss] MPICH deadlock error

Sarika K sarikauniv at gmail.com
Mon Feb 13 15:28:03 CST 2012


Dear MPICH-discuss group:

My work involves working with Fortran Code using MPICH for parallelization.
But I have a very limited experience with the details of MPICH
implementation. (I have always treated the MPICh part of the code as a
black box).

I am now working on porting the code across different machine
configurations. My modeling code works fine on some machines/servers. But
it also generates random MPI deadlock errors when running the simulations
across other machines/servers.

The error message is below.
"Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(174): MPI_Send(buf=0x7f4d9b375010, count=1, dtype=USER<vector>,
dest=1, tag=10001, MPI_COMM_WORLD) failed
MPID_Send(53): DEADLOCK: attempting to send a message to the local process
without a prior matching receive"

I searched this list/other resources for this error code and strongly
believe that there is a bug in the model MPI implementation code which
remains dormant in some environments and works fine due to the internal
buffering threshold dependance.

I am not sure if this is sufficient information, but attached below sample
subroutine (there are many inside the code) which generates the deadlock
error.

I would really appreciate any help/pointers from the group to fix this
error in our code.

Thanks in advance for your time and assistance,
Sarika

c-----------------------------------------------------------------------------------------------------------------------------
      subroutine int_distrib1(iend)
c-----------------------
c  Master distributes another bunch of integers to Workers
c-----------------------------------------------------------------------------------------------------------------------------
c
      use ParallelDataMap
      use CommDataTypes
      implicit none
      include 'mpif.h'
c
      include 'aqmax.param'
      include 'aqindx.cmm'
c
      integer :: iend
      integer, parameter ::  Nbuf=35
      integer ::  i, j, k, buf(Nbuf), Ierr, status(MPI_STATUS_SIZE)
c
      if (Master) then
! arguments
    buf(1) = iend
!  /aqspid/ in aqindx.cmm stuff
    buf(2) = iair
    buf(3) = ih2o
    buf(4) = io2
    buf(5) = ico
    buf(6) = ino2
    buf(7) = iho2
    buf(8) = iso2
    buf(9) = io3
    buf(10)= ich4
    buf(11)= ico2
    buf(12)= ih2
    buf(13)= in2
    buf(14)= itrace
    k=15
    buf(k:k+9) = ispg_idx(1:10); k=k+10
    buf(k:k+9) = ispl_idx(1:10); k=k+10

    do i=1,Nworkers
      call MPI_SEND(buf, Nbuf, MPI_INTEGER,
     &         i, i,  MPI_COMM_WORLD, Ierr)

    enddo
    print*, ''
    print*, 'done sending int_distrib1'
    print*, ''
      endif   !   (Master)
c
c
      if (Worker) then
        call MPI_RECV(buf, Nbuf, MPI_INTEGER, 0, MyId,
     &                 MPI_COMM_WORLD, status, ierr)
    iend  = buf(1)
! /aqspid/ in aqindx.cmm stuff
    iair  = buf(2)
    ih2o  = buf(3)
    io2   = buf(4)
    ico   = buf(5)
    ino2  = buf(6)
    iho2  = buf(7)
    iso2  = buf(8)
    io3   = buf(9)
    ich4  = buf(10)
    ico2  = buf(11)
    ih2   = buf(12)
    in2   = buf(13)
    itrace= buf(14)
    k=15
    ispg_idx(1:10) = buf(k:k+9); k=k+10
    ispl_idx(1:10) = buf(k:k+9); k=k+10
    print*, ''
    print*, 'done receiving int_distrib1'
    print*, ''
      endif  !    (Worker)
c
      end  subroutine int_distrib1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120213/77d0cc5e/attachment.htm>


More information about the mpich-discuss mailing list