[mpich-discuss] Questions on fault tolerance implementations of MPICH2-1.3.1
Darius Buntinas
buntinas at mcs.anl.gov
Mon Dec 6 17:51:40 CST 2010
Hi Rui,
1> MPICH2 1.3.1 doesn't actually detect process failures. Rather it detects and reacts to communication failures. This means that a process or node failure will only be detected by another process if it tries to communicates with it, and may only happen after a network timeout.
2> MPI supports user-defined error handlers. Here's a link to the relevant section of the MPI standard.
http://www.mpi-forum.org/docs/mpi22-report/node35.htm#Node35
3> As I mentioned above, MPICH2 1.3.1 only detects communication errors. So, it's possible that a process calls MPI_Recv to receive a message from a dead process, but because MPICH2 uses on-demand connections where the sender initiates the connection, that process will never encounter a communication error and will hang.
We are working on integration with the FTB (google CIFTS for more info) to catch node and process failure events to do a better job of detecting failures.
-d
On Dec 5, 2010, at 4:38 AM, 王睿 wrote:
> Hi,
>
> I have some questions on the fault tolerance support of MPICH2-1.3.1
>
> 1> Can the newest version of MPICH2 detect a process failure? If so, how the other processes get notified? (From a programmer's view)
>
> 2> Can MPICH2-1.3.1 support user-defined error handler? If not, how to do some recovery work after a process failure?
>
> 3> If one process is killed, it will not affect other processes' Send/Recv, but the MPI environment seems to wait the dead process. How to get the whole job normally exited instead of using 'Ctrl+C'.
>
> Best Regards,
> --
> Rui Wang
> Institute of Computing Technology, CAS, Beijing, P.R.China
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list