<div dir="ltr">Dear mpich2.<div>I try to use fault tolerance MPICH2 in my application.</div><div><br></div><div>Application has a model of master &amp; slaves.</div><div>Master &amp; slaves are all multi-threaded processes.</div>

<div><br></div><div>Separate thread in Master &amp; slaves started with calling MPI_Irecv on each rank, including self with MPI_ANY_TAG, then it calls MPI_Waitany. </div><div>In the case of receiving new message, thread process it, and call MPI_Irecv on sender rank again.</div>

<div><br></div><div>Slaves send data to Master, and master only receives data.</div><div><br></div><div><div>Via another communicator master call to collective operations such as MPI_Reduce, to control slaves</div><br class="Apple-interchange-newline">

</div><div>After failing of one of the slaves with floating point exception, rest of the slaves starting receive massage from rank=0 (master) with tag=0. After receiving first such message, slaves thread, calls again MPI_Irecv for rank 0, and calls  MPI_Waitany, and immediately receives same message again.</div>

<div><br></div>

<div>It&#39;s difficult to simulate same situation on stand alone program.</div><div>May be you familiar with this effect, or it may be my inner logic problem.</div><div><br></div><div>When program runs without failure, no one of slaves receives any message from master or from other slave.</div>

<div><br></div><div>Anatoly.</div></div>