Hi, <div><div> I did some tests, It seems that MPICH2-1.3,1 will hang a process if it tries to send a message to/receive from a dead process. I wonder whether there is a version which will not hang such a process and return an error code instead. How can I get it? And hong long is the network timeout mentioned above? </div>
<div><br></div><div>Besides, is the updating of MVAPICH in step with MPICH2, we also need infiniband support. </div><div><br></div><div>Note that our work is only for research use.</div><div><br></div><div>Best Regard,</div>
<div>Rui </div><div><br></div><div><div class="gmail_quote">2010/12/7 Darius Buntinas <span dir="ltr"><<a href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi Rui,<br>
<br>
1> MPICH2 1.3.1 doesn't actually detect process failures. Rather it detects and reacts to communication failures. This means that a process or node failure will only be detected by another process if it tries to communicates with it, and may only happen after a network timeout.<br>
<br>
2> MPI supports user-defined error handlers. Here's a link to the relevant section of the MPI standard.<br>
<a href="http://www.mpi-forum.org/docs/mpi22-report/node35.htm#Node35" target="_blank">http://www.mpi-forum.org/docs/mpi22-report/node35.htm#Node35</a><br>
<br>
3> As I mentioned above, MPICH2 1.3.1 only detects communication errors. So, it's possible that a process calls MPI_Recv to receive a message from a dead process, but because MPICH2 uses on-demand connections where the sender initiates the connection, that process will never encounter a communication error and will hang.<br>
<br>
We are working on integration with the FTB (google CIFTS for more info) to catch node and process failure events to do a better job of detecting failures.<br>
<br>
-d<br>
<div><div></div><div class="h5"><br>
<br>
On Dec 5, 2010, at 4:38 AM, Íõî£ wrote:<br>
<br>
> Hi,<br>
><br>
> I have some questions on the fault tolerance support of MPICH2-1.3.1<br>
><br>
> 1> Can the newest version of MPICH2 detect a process failure? If so, how the other processes get notified? (From a programmer's view)<br>
><br>
> 2> Can MPICH2-1.3.1 support user-defined error handler? If not, how to do some recovery work after a process failure?<br>
><br>
> 3> If one process is killed, it will not affect other processes' Send/Recv, but the MPI environment seems to wait the dead process. How to get the whole job normally exited instead of using 'Ctrl+C'.<br>
><br>
> Best Regards,<br>
> --<br>
> Rui Wang<br>
> Institute of Computing Technology, CAS, Beijing, P.R.China<br>
</div></div>> _______________________________________________<br>
> mpich-discuss mailing list<br>
> <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
> <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
<br>
_______________________________________________<br>
mpich-discuss mailing list<br>
<a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
</blockquote></div><br><br clear="all"><br>-- <br>Rui Wang<br>Institute of Computing Technology, CAS, Beijing, P.R.China<br>
</div></div>