[mpich-discuss] Fault tolerance on collectives

Anatoly G anatolyrishon at gmail.com
Tue Feb 28 06:10:16 CST 2012


Dear MPICH2,

The MPICH2 readme tels following:

- COLLECTIVES: For collective operations performed on communicators
   with a failed process, the collective would return an error on
   some, but not necessarily all processes. A collective call
   returning MPI_SUCCESS on a given process means that the part of the
   collective performed by that process has been successful.

Can you please answer on more specific questions:

   - May I get MPI_SUCCESS on collective operation,  If this collective
   operation called after one of communicator processes failed. (Fail happened
   before operation call).
   - I still have not simple example, but it looks like executing
   collective operation on single computer with multiple processes returns
   error (recognizes that one of the processes failed), but executing same
   collective operation on cluster return MPI_SUCCESS.

Anatoly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120228/eef2c90e/attachment.htm>


More information about the mpich-discuss mailing list