<div dir="ltr">Dear mpich-discuss,<div>I have a problem while using fault tolerance feature, in MPICH2 hydra process manager.</div><div>The results are not consistent, sometimes tests pass, sometimes stall.</div><div>If you executes command line written below in loop, after number of iterations, test stall.</div>
<div>Can you please help me with this problem? </div><div><br></div>
<div>There are 3 tests. All 3 tests have same model master with number of slaves. Communication operations are point to point.</div><div><b><br></b></div><div><b>Slave algorithm is same for all 3 tests.</b></div><div>for N times:</div>
<div> MPI_Send integer to master.</div><div> if IterI (parameter) && rank= fail_rank</div><div> cause divide by zero exception. (A = 5.0; B = 0.0; C = A / B;)</div><div>MPI_Recv(master)</div><div><b><br>
</b></div><div><b>Master algorithm Test1 (mpi_send_rcv_waitany.cpp) :</b></div><div><ul><li>For each slave call MPI_Irecv</li><li>while not got N messages from each slave continue</li><li> MPI_Waitany(slaveIdx)</li><li>
if slaveIdx alive</li><li> MPI_Irecv(slaveIdx)</li><li> else</li><li> Mark it as finished.</li><li>MPI_Send to all slaves .</li></ul><div><br></div></div><div><div><b>Master algorithm Test2 (mpi_send_sync.cpp) :</b></div>
<div><ul><li>slave = first slave</li><li>while not got N messages from each slave continue</li><li> MPI_Recv(slave)</li><li> if slaveIdx alive</li><li> pass to next live slave</li><li> else</li><li>
Mark it as finished.</li><li>MPI_Send to all slaves .</li></ul></div></div><div><br></div><div><b>Master algorithm Test3 (mpi_send_async.cpp) : </b></div><div>Same as test2, but instead of MPI_Recv, I use MPI_Irecv + MPI_Wait</div>
<div><br></div><div>When test stall, I connect debugger to master process.</div><div>Process stall in MPI_recv, or MPI_Irecv.</div><div>I think, stall caused by following sequence:</div><div><ul><li>Master receives integer from slave.</li>
<li>Tests slave - it's Ok.</li><li>Slave failes</li><li>master try to perform MPI_Irecv or MPI_Recv on failed slave.</li></ul><div>Problem happens on cluster (student_machines.txt) & on single machine (machine_student1.txt)</div>
<div><br></div><div>Execution lines:</div></div><div><ul><li>/space/local/hydra/bin/mpiexec.hydra -genvall -disable-auto-cleanup -f machine_student1.txt -n 8 -launcher=rsh mpi_rcv_waitany 100000 1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_</li>
<li>/space/local/hydra/bin/mpiexec.hydra -genvall -disable-auto-cleanup -f student_machines.txt -n 12 -launcher=rsh mpi_rcv_waitany 100000 1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_</li></ul></div><div>Test performs 100000 iterations master with each slave.</div>
<div>1000000 scale number to distinguish between sequences if integers with master & each slave.</div><div>3 - rank of process to cause fail (fail_rank)</div><div>10 - fail iteration. On iteration 10 process with rank 3 will cause divide by zero exception.</div>
<div>1 logs/mpi_rcv_waitany_it_9/res_ defines log file.</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div>