Dear MPICH-discuss group:<br><br>My work involves working with Fortran Code using MPICH for parallelization. But I have a very limited experience with the details of MPICH implementation. (I have always treated the MPICh part of the code as a black box).<br>
<br>I am now working on porting the code across different machine configurations. My modeling code works fine on some machines/servers. But it also generates random MPI deadlock errors when running the simulations across other machines/servers.<br>
<br>The error message is below.<br>"Fatal error in MPI_Send: Other MPI error, error stack:<br>MPI_Send(174): MPI_Send(buf=0x7f4d9b375010, count=1, dtype=USER<vector>, dest=1, tag=10001, MPI_COMM_WORLD) failed<br>
MPID_Send(53): DEADLOCK: attempting to send a message to the local process without a prior matching receive"<br><br>I searched this list/other resources for this error code and strongly believe that there is a bug in the model MPI implementation code which remains dormant in some environments and works fine due to the internal buffering threshold dependance.<br>
<br>I am not sure if this is sufficient information, but attached below sample subroutine (there are many inside the code) which generates the deadlock error.<br><br>
I would really appreciate any help/pointers from the group to fix this error in our code.<br><br>Thanks in advance for your time and assistance,<br>Sarika<br><br>c-----------------------------------------------------------------------------------------------------------------------------<br>
subroutine int_distrib1(iend) <br>c-----------------------<br>c Master distributes another bunch of integers to Workers<br>c-----------------------------------------------------------------------------------------------------------------------------<br>
c <br> use ParallelDataMap<br> use CommDataTypes<br> implicit none<br> include 'mpif.h'<br>c <br> include 'aqmax.param'<br> include 'aqindx.cmm'<br>c <br>
integer :: iend<br> integer, parameter :: Nbuf=35<br> integer :: i, j, k, buf(Nbuf), Ierr, status(MPI_STATUS_SIZE) <br>c<br> if (Master) then<br>! arguments<br> buf(1) = iend<br>! /aqspid/ in aqindx.cmm stuff<br>
buf(2) = iair<br> buf(3) = ih2o<br> buf(4) = io2<br> buf(5) = ico <br> buf(6) = ino2<br> buf(7) = iho2<br> buf(8) = iso2<br> buf(9) = io3<br> buf(10)= ich4<br> buf(11)= ico2<br> buf(12)= ih2<br>
buf(13)= in2<br> buf(14)= itrace<br> k=15<br> buf(k:k+9) = ispg_idx(1:10); k=k+10<br> buf(k:k+9) = ispl_idx(1:10); k=k+10<br><br> do i=1,Nworkers<br> call MPI_SEND(buf, Nbuf, MPI_INTEGER, <br> & i, i, MPI_COMM_WORLD, Ierr)<br>
<br> enddo<br> print*, ''<br> print*, 'done sending int_distrib1'<br> print*, ''<br> endif ! (Master)<br>c<br>c <br> if (Worker) then<br> call MPI_RECV(buf, Nbuf, MPI_INTEGER, 0, MyId, <br>
& MPI_COMM_WORLD, status, ierr)<br> iend = buf(1)<br>! /aqspid/ in aqindx.cmm stuff<br> iair = buf(2)<br> ih2o = buf(3)<br> io2 = buf(4)<br> ico = buf(5)<br> ino2 = buf(6)<br>
iho2 = buf(7) <br> iso2 = buf(8)<br> io3 = buf(9)<br> ich4 = buf(10)<br> ico2 = buf(11)<br> ih2 = buf(12)<br> in2 = buf(13)<br> itrace= buf(14)<br> k=15<br> ispg_idx(1:10) = buf(k:k+9); k=k+10<br>
ispl_idx(1:10) = buf(k:k+9); k=k+10<br> print*, ''<br> print*, 'done receiving int_distrib1'<br> print*, ''<br> endif ! (Worker)<br>c <br> end subroutine int_distrib1<br>
<br><br><br>