[mpich-discuss] Fault tolerance - not stable.

Anatoly G anatolyrishon at gmail.com
Sun Feb 19 08:38:02 CST 2012


Dear mpich2.
I try to use fault tolerance MPICH2 in my application.

Application has a model of master & slaves.
Master & slaves are all multi-threaded processes.

Separate thread in Master & slaves started with calling MPI_Irecv on each
rank, including self with MPI_ANY_TAG, then it calls MPI_Waitany.
In the case of receiving new message, thread process it, and call MPI_Irecv
on sender rank again.

Slaves send data to Master, and master only receives data.

Via another communicator master call to collective operations such as
MPI_Reduce, to control slaves

After failing of one of the slaves with floating point exception, rest of
the slaves starting receive massage from rank=0 (master) with tag=0. After
receiving first such message, slaves thread, calls again MPI_Irecv for rank
0, and calls  MPI_Waitany, and immediately receives same message again.

It's difficult to simulate same situation on stand alone program.
May be you familiar with this effect, or it may be my inner logic problem.

When program runs without failure, no one of slaves receives any message
from master or from other slave.

Anatoly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120219/e6b7e01c/attachment.htm>


More information about the mpich-discuss mailing list