[mpich-discuss] Questions on fault tolerance implementations of MPICH2-1.3.1

王睿 wangraying at gmail.com
Wed Dec 8 21:07:13 CST 2010


Hi,
 I did some tests, It seems that MPICH2-1.3,1 will hang a process if it
tries to send a message to/receive  from a dead process.  I wonder whether
there is a version which will not hang such a process and return an error
code instead. How can I get it? And hong long is the network timeout
mentioned above?

Besides, is the updating of MVAPICH in step with MPICH2, we also need
infiniband support.

Note that our work is only for research use.

Best Regard,
Rui

2010/12/7 Darius Buntinas <buntinas at mcs.anl.gov>

> Hi Rui,
>
> 1> MPICH2 1.3.1 doesn't actually detect process failures.  Rather it
> detects and reacts to communication failures.  This means that a process or
> node failure will only be detected by another process if it tries to
> communicates with it, and may only happen after a network timeout.
>
> 2> MPI supports user-defined error handlers.  Here's a link to the relevant
> section of the MPI standard.
>    http://www.mpi-forum.org/docs/mpi22-report/node35.htm#Node35
>
> 3>  As I mentioned above, MPICH2 1.3.1 only detects communication errors.
>  So, it's possible that a process calls MPI_Recv to receive a message from a
> dead process, but because MPICH2 uses on-demand connections where the sender
> initiates the connection, that process will never encounter a communication
> error and will hang.
>
> We are working on integration with the FTB (google CIFTS for more info) to
> catch node and process failure events to do a better job of detecting
> failures.
>
> -d
>
>
> On Dec 5, 2010, at 4:38 AM, 王睿 wrote:
>
> > Hi,
> >
> > I have some questions on the fault tolerance support of MPICH2-1.3.1
> >
> > 1> Can the newest version of MPICH2 detect a process failure? If so, how
> the other processes get notified? (From a programmer's view)
> >
> > 2> Can MPICH2-1.3.1 support user-defined error handler? If not, how to do
> some recovery work after a process failure?
> >
> > 3> If one process is killed, it will not affect other processes'
> Send/Recv, but the MPI environment seems to wait the dead process. How to
> get the whole job normally exited instead of using 'Ctrl+C'.
> >
> > Best Regards,
> > --
> > Rui Wang
> > Institute of Computing Technology, CAS,  Beijing, P.R.China
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
Rui Wang
Institute of Computing Technology, CAS,  Beijing, P.R.China
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101209/2585fdaf/attachment-0001.htm>


More information about the mpich-discuss mailing list