[mpich-discuss] communication involves a dead process

王睿 wangraying at gmail.com
Mon Dec 6 03:33:09 CST 2010


Hi, all

I'm a student learning fault tolerance of MPI,  kind of on a startup. I'm
not very familiar with MPI, but the project I'm working on is in urgent need
of an MPI implementation which could provide fault tolerant to some extent.
So I ask you for help.

The newly announced  version of MPICH says that

> " collective operations on communicators containing the failed process
> is undefined, and may give incorrect results or hang some processes."


But it seems that it is hard for us not to use collective communications and
also difficult to guarantee all nodes alive, especially in HPC.  So, If I
want to modify the source code concerning "Barrier" operation of
MPICH2-1.3.1, could you give me some advice, such as the amount of work
etc?
The barrier operation which I need  should exclude the dead process and include
a new process which take place of the dead process.


MPICH2-1.3.1 also supports that

> " If a communication operation fails (e.g., due to a process failure)
> MPICH2 will return an error, and further communication to that process is
> not possible. However, communication with other processes will still proceed
> normally."


I want to know that :

1, What information can be retrieved from the error which MPICH2 returns on
a node failure ? (such as which process is dead, etc. ?)

2, can we create a new communicator containing the processes alive after a
process failure?  If so, could you possibly list some possible means?

Best Regards,

Rui
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101206/9717d2aa/attachment-0001.htm>


More information about the mpich-discuss mailing list