[mpich-discuss] Fault tolerance on collectives
Anatoly G
anatolyrishon at gmail.com
Tue Feb 28 06:10:16 CST 2012
Dear MPICH2,
The MPICH2 readme tels following:
- COLLECTIVES: For collective operations performed on communicators
with a failed process, the collective would return an error on
some, but not necessarily all processes. A collective call
returning MPI_SUCCESS on a given process means that the part of the
collective performed by that process has been successful.
Can you please answer on more specific questions:
- May I get MPI_SUCCESS on collective operation, If this collective
operation called after one of communicator processes failed. (Fail happened
before operation call).
- I still have not simple example, but it looks like executing
collective operation on single computer with multiple processes returns
error (recognizes that one of the processes failed), but executing same
collective operation on cluster return MPI_SUCCESS.
Anatoly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120228/eef2c90e/attachment.htm>
More information about the mpich-discuss
mailing list