<div dir="ltr"><div class="gmail_quote"><div dir="ltr"><div>Dear Mpich2.</div><div><br></div><div>I currently use <b>mpich2</b>.</div><div><br></div><div><b>I have a problem with Fault tolerance feature:</b></div><div>My program runs as master/slaves application.</div>
<div>Master uses asynchronous MPI_IRecv operations & polling using MPI_Test. When one of the slaves fails, all application fails. </div><div><br></div><div><b>Question:</b></div><div>What should I do to keep application alive?</div>
<div><br></div><div><b>Mpich2 configuration</b>:</div><div><p class="MsoNormal"><span style="font-family:Arial,sans-serif;font-size:13px"> $ ./configure --with-ftb=/space/local --prefix=/space/local</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""><b><br></b></span></p><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""><b>Execution command:</b></span></p>
<p class="MsoNormal"><span style="font-family:Arial,sans-serif">mpiexec.hydra -genvall -f machines_student.txt -n 3 -launcher=rsh mpi_send_rec_testany 1000 10000 2 20 1 logs/res_test</span></p><p class="MsoNormal">
<font face="Arial, sans-serif"></font></p><div><font face="Arial, sans-serif"><br></font></div><div><font face="Arial, sans-serif"><b>Error handler: </b>MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);</font></div>
<p></p>
<p class="MsoNormal"><b style="font-family:Arial,sans-serif;font-size:13px"><br></b></p><p class="MsoNormal"><font face="Arial, sans-serif"><b>Output on the screen:</b></font></p><p class="MsoNormal">
<font face="Arial, sans-serif"></font></p><p class="MsoNormal"><font face="Arial, sans-serif">=====================================================================================</font></p>
<p class="MsoNormal"><font face="Arial, sans-serif">= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES</font></p><p class="MsoNormal"><font face="Arial, sans-serif">= EXIT CODE: 136</font></p>
<p class="MsoNormal"><font face="Arial, sans-serif">= CLEANING UP REMAINING PROCESSES</font></p><p class="MsoNormal"><font face="Arial, sans-serif">= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES</font></p>
<p class="MsoNormal"><font face="Arial, sans-serif">=====================================================================================</font></p><p class="MsoNormal"><font face="Arial, sans-serif">[proxy:0:0@student1-ib0] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed</font></p>
<p class="MsoNormal"><font face="Arial, sans-serif">[proxy:0:0@student1-ib0] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status</font></p><p class="MsoNormal">
<font face="Arial, sans-serif">[proxy:0:0@student1-ib0] main (./pm/pmiserv/pmip.c:225): demux engine error waiting for event</font></p><p class="MsoNormal"><font face="Arial, sans-serif">[mpiexec@student1-ib0] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting</font></p>
<p class="MsoNormal"><font face="Arial, sans-serif">[mpiexec@student1-ib0] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion</font></p>
<p class="MsoNormal"><font face="Arial, sans-serif">[mpiexec@student1-ib0] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for completion</font></p>
<p class="MsoNormal"><font face="Arial, sans-serif">[mpiexec@student1-ib0] main (./ui/mpich/mpiexec.c:420): process manager error waiting for completion</font></p><div style="font-weight:bold"><font face="Arial, sans-serif"><br>
</font></div><div style="font-weight:bold"><font face="Arial, sans-serif"><br></font></div><div style="font-weight:bold"><font face="Arial, sans-serif">Remark:</font></div>
<div><font face="Arial, sans-serif">When all saves alive, application works fine.</font></div><p></p></div></div></div></div>