[mpich-discuss] Questions on fault tolerance implementations of MPICH2-1.3.1

王睿 wangraying at gmail.com
Mon Dec 6 20:01:46 CST 2010


Thank you very much!

Best Regards

Rui

2010/12/7 Darius Buntinas <buntinas at mcs.anl.gov>

> Hi Rui,
>
> 1> MPICH2 1.3.1 doesn't actually detect process failures.  Rather it
> detects and reacts to communication failures.  This means that a process or
> node failure will only be detected by another process if it tries to
> communicates with it, and may only happen after a network timeout.
>
> 2> MPI supports user-defined error handlers.  Here's a link to the relevant
> section of the MPI standard.
>    http://www.mpi-forum.org/docs/mpi22-report/node35.htm#Node35
>
> 3>  As I mentioned above, MPICH2 1.3.1 only detects communication errors.
>  So, it's possible that a process calls MPI_Recv to receive a message from a
> dead process, but because MPICH2 uses on-demand connections where the sender
> initiates the connection, that process will never encounter a communication
> error and will hang.
>
> We are working on integration with the FTB (google CIFTS for more info) to
> catch node and process failure events to do a better job of detecting
> failures.
>
> -d
>
>
> On Dec 5, 2010, at 4:38 AM, 王睿 wrote:
>
> > Hi,
> >
> > I have some questions on the fault tolerance support of MPICH2-1.3.1
> >
> > 1> Can the newest version of MPICH2 detect a process failure? If so, how
> the other processes get notified? (From a programmer's view)
> >
> > 2> Can MPICH2-1.3.1 support user-defined error handler? If not, how to do
> some recovery work after a process failure?
> >
> > 3> If one process is killed, it will not affect other processes'
> Send/Recv, but the MPI environment seems to wait the dead process. How to
> get the whole job normally exited instead of using 'Ctrl+C'.
> >
> > Best Regards,
> > --
> > Rui Wang
> > Institute of Computing Technology, CAS,  Beijing, P.R.China
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
Rui Wang
Institute of Computing Technology, CAS,  Beijing, P.R.China
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101207/a2cdcd82/attachment.htm>


More information about the mpich-discuss mailing list