[mpich-discuss] communication involves a dead process

Mon Dec 6 18:58:58 CST 2010

This is something that the MPI Forum is working on.  There are some tricky issues with automatically "fixing up" communicators when a process fails.

One issue is how to detect, globally, that a process failed.  It's possible that when the processes of a communicator call MPI_Barrier, not all processes know that a process has failed.  Now you have some processes using one communication pattern and some using another.  

The Forum is looking at adding a "validate communicator" call to notify all processes which processes have failed in the communicator.  Here's a link to the proposal:
https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization

When a send or (non ANY_SOURCE) receive returns an error, you know which process failed.  For an ANY_SOURCE receive or collective returns an error, I'm not sure how the you could get the info on which process failed.  The current MPI standard does not address this.

Unfortunately, the communicator creation routines involve collective communication operations, and you can't do that on a communicator with a failed process.  Again there are some things that the Forum is working on to address this.

I hope this clarifies some things for you.

-d

On Dec 6, 2010, at 1:33 AM, 王睿 wrote:

> Hi, all
> 
> I'm a student learning fault tolerance of MPI,  kind of on a startup. I'm not very familiar with MPI, but the project I'm working on is in urgent need of an MPI implementation which could provide fault tolerant to some extent. So I ask you for help.
> 
> The newly announced  version of MPICH says that 
> " collective operations on communicators containing the failed process is undefined, and may give incorrect results or hang some processes."  
> 
> But it seems that it is hard for us not to use collective communications and also difficult to guarantee all nodes alive, especially in HPC.  So, If I want to modify the source code concerning "Barrier" operation of MPICH2-1.3.1, could you give me some advice, such as the amount of work etc?  
> The barrier operation which I need  should exclude the dead process and include a new process which take place of the dead process.  
> 
> 
> MPICH2-1.3.1 also supports that 
> " If a communication operation fails (e.g., due to a process failure) MPICH2 will return an error, and further communication to that process is not possible. However, communication with other processes will still proceed normally." 
> 
> I want to know that :
> 
> 1, What information can be retrieved from the error which MPICH2 returns on a node failure ? (such as which process is dead, etc. ?)
> 
> 2, can we create a new communicator containing the processes alive after a process failure?  If so, could you possibly list some possible means?
> 
> Best Regards,
> 
> Rui
> 
>  
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss