[mpich-discuss] Questions on fault tolerance implementations of MPICH2-1.3.1

Darius Buntinas buntinas at mcs.anl.gov
Thu Dec 9 14:31:38 CST 2010


Hi Rui,

I'm not sure how long the timeout is, but it might be on the order of several minutes (or even longer).

I am currently working on adding the FTB integration to detect process failures, but it's not complete yet.  I've created a ticket to track the progress on this.  You can add yourself to the CC list if you want to be notified of any changes.

    https://trac.mcs.anl.gov/projects/mpich2/ticket/1145

MVAPICH is typically a major release behind MPICH2, so they most likely don't have these features in their release.  You can inquire with them to see what type of fault tolerance they support.

Sorry I couldn't give you better news.

-d

On Dec 8, 2010, at 9:07 PM, 王睿 wrote:

> Hi, 
>  I did some tests, It seems that MPICH2-1.3,1 will hang a process if it tries to send a message to/receive  from a dead process.  I wonder whether there is a version which will not hang such a process and return an error code instead. How can I get it? And hong long is the network timeout mentioned above? 
> 
> Besides, is the updating of MVAPICH in step with MPICH2, we also need infiniband support. 
> 
> Note that our work is only for research use.
> 
> Best Regard,
> Rui 
> 
> 2010/12/7 Darius Buntinas <buntinas at mcs.anl.gov>
> Hi Rui,
> 
> 1> MPICH2 1.3.1 doesn't actually detect process failures.  Rather it detects and reacts to communication failures.  This means that a process or node failure will only be detected by another process if it tries to communicates with it, and may only happen after a network timeout.
> 
> 2> MPI supports user-defined error handlers.  Here's a link to the relevant section of the MPI standard.
>    http://www.mpi-forum.org/docs/mpi22-report/node35.htm#Node35
> 
> 3>  As I mentioned above, MPICH2 1.3.1 only detects communication errors.  So, it's possible that a process calls MPI_Recv to receive a message from a dead process, but because MPICH2 uses on-demand connections where the sender initiates the connection, that process will never encounter a communication error and will hang.
> 
> We are working on integration with the FTB (google CIFTS for more info) to catch node and process failure events to do a better job of detecting failures.
> 
> -d
> 
> 
> On Dec 5, 2010, at 4:38 AM, 王睿 wrote:
> 
> > Hi,
> >
> > I have some questions on the fault tolerance support of MPICH2-1.3.1
> >
> > 1> Can the newest version of MPICH2 detect a process failure? If so, how the other processes get notified? (From a programmer's view)
> >
> > 2> Can MPICH2-1.3.1 support user-defined error handler? If not, how to do some recovery work after a process failure?
> >
> > 3> If one process is killed, it will not affect other processes' Send/Recv, but the MPI environment seems to wait the dead process. How to get the whole job normally exited instead of using 'Ctrl+C'.
> >
> > Best Regards,
> > --
> > Rui Wang
> > Institute of Computing Technology, CAS,  Beijing, P.R.China
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> 
> -- 
> Rui Wang
> Institute of Computing Technology, CAS,  Beijing, P.R.China
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list