[mpich-discuss] Questions on fault tolerance implementations of MPICH2-1.3.1

Mon Dec 6 17:51:40 CST 2010

Hi Rui,

1> MPICH2 1.3.1 doesn't actually detect process failures.  Rather it detects and reacts to communication failures.  This means that a process or node failure will only be detected by another process if it tries to communicates with it, and may only happen after a network timeout.

2> MPI supports user-defined error handlers.  Here's a link to the relevant section of the MPI standard.
    http://www.mpi-forum.org/docs/mpi22-report/node35.htm#Node35

3>  As I mentioned above, MPICH2 1.3.1 only detects communication errors.  So, it's possible that a process calls MPI_Recv to receive a message from a dead process, but because MPICH2 uses on-demand connections where the sender initiates the connection, that process will never encounter a communication error and will hang.

We are working on integration with the FTB (google CIFTS for more info) to catch node and process failure events to do a better job of detecting failures.

-d

On Dec 5, 2010, at 4:38 AM, 王睿 wrote:

> Hi, 
> 
> I have some questions on the fault tolerance support of MPICH2-1.3.1
> 
> 1> Can the newest version of MPICH2 detect a process failure? If so, how the other processes get notified? (From a programmer's view)
> 
> 2> Can MPICH2-1.3.1 support user-defined error handler? If not, how to do some recovery work after a process failure?
> 
> 3> If one process is killed, it will not affect other processes' Send/Recv, but the MPI environment seems to wait the dead process. How to get the whole job normally exited instead of using 'Ctrl+C'.
> 
> Best Regards,
> -- 
> Rui Wang
> Institute of Computing Technology, CAS,  Beijing, P.R.China
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss