<div dir="ltr">Dear mpich-discuss,<div>I have a problem while using fault tolerance feature, in MPICH2 hydra process manager.</div><div>The results are not consistent, sometimes tests pass, sometimes stall.</div><div>If you executes command line written below in loop, after number of iterations, test stall.</div>

<div>Can you please help me with this problem? </div><div><br></div>


<div>There are 3 tests. All 3 tests have same model master with number of slaves. Communication operations are point to point.</div><div><b><br></b></div><div><b>Slave algorithm is same for all 3 tests.</b></div><div>for N times:</div>

<div>    MPI_Send integer to master.</div><div>    if IterI (parameter) &amp;&amp; rank= fail_rank</div><div>        cause divide by zero exception. (A = 5.0; B = 0.0;  C = A / B;)</div><div>MPI_Recv(master)</div><div><b><br>

</b></div><div><b>Master algorithm Test1 (mpi_send_rcv_waitany.cpp) :</b></div><div><ul><li>For each slave call MPI_Irecv</li><li>while not got N messages from each slave continue</li><li>      MPI_Waitany(slaveIdx)</li><li>

      if slaveIdx alive</li><li>         MPI_Irecv(slaveIdx)</li><li>      else</li><li>         Mark it as finished.</li><li>MPI_Send to all slaves .</li></ul><div><br></div></div><div><div><b>Master algorithm Test2 (mpi_send_sync.cpp) :</b></div>

<div><ul><li>slave = first slave</li><li>while not got N messages from each slave continue</li><li>      MPI_Recv(slave)</li><li>      if slaveIdx alive</li><li>         pass to next live slave</li><li>      else</li><li>

         Mark it as finished.</li><li>MPI_Send to all slaves .</li></ul></div></div><div><br></div><div><b>Master algorithm Test3 (mpi_send_async.cpp) :   </b></div><div>Same as test2, but instead of MPI_Recv, I use MPI_Irecv + MPI_Wait</div>

<div><br></div><div>When test stall, I connect debugger to master process.</div><div>Process stall in MPI_recv, or MPI_Irecv.</div><div>I think, stall caused by following sequence:</div><div><ul><li>Master receives integer from slave.</li>

<li>Tests slave - it&#39;s Ok.</li><li>Slave failes</li><li>master try to perform MPI_Irecv or MPI_Recv on failed slave.</li></ul><div>Problem happens on cluster (student_machines.txt) &amp; on single machine (machine_student1.txt)</div>

<div><br></div><div>Execution lines:</div></div><div><ul><li>/space/local/hydra/bin/mpiexec.hydra  -genvall  -disable-auto-cleanup  -f machine_student1.txt  -n 8  -launcher=rsh mpi_rcv_waitany 100000 1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_</li>

<li>/space/local/hydra/bin/mpiexec.hydra  -genvall  -disable-auto-cleanup  -f student_machines.txt  -n 12  -launcher=rsh mpi_rcv_waitany 100000 1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_</li></ul></div><div>Test performs 100000 iterations master with each slave.</div>

<div>1000000 scale number to distinguish between sequences if integers with master &amp; each slave.</div><div>3 - rank of process to cause fail (fail_rank)</div><div>10 - fail iteration. On iteration 10 process with rank 3 will cause divide by zero exception.</div>

<div>1 logs/mpi_rcv_waitany_it_9/res_     defines log file.</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div>