[mpich-discuss] disable-auto-cleanup send/receive example
Darius Buntinas
buntinas at mcs.anl.gov
Thu Nov 3 14:47:48 CDT 2011
Hi Rob,
The problem we're trying to address with the wildcard receives is the case where a process does a blocking wildcard receive (or a nonblocking wildcard receive with a blocking wait) waiting for messages from some set of processes. If all of the processes in that set failed, then the receive (or wait) will block forever (all of the potential senders are dead).
So to handle that case, the fault tolerance working group at the MPI Forum is proposing to make all wildcard receives return an error when a process fails. We do this by setting the communicator into a "ANY_SOURCE disabled state". In this state any wildcard receives any wildcard receives already posted as well as any new wildcard receives will return an error. A new function is proposed to allow the application to re-enable wildcard receives if it decides that some potential senders are still alive.
That's what is being proposed at the Forum. But that's different from the current MPICH2 implementation. In MPICH2, only already posted wildcard receives will be completed with an error when a failure is detected. You should be able to post wildcard receives after the failure is detected. It looks like in your program, the MPICH2 library was detecting the same failure more than once, so several wildcard receives returned an error. This is why the sleep(1) helped that problem, then the Iprobe was there so the failure message from the process manager could be handled.
We are currently working on implementing the stuff in the MPI Forum proposal.
-d
On Nov 3, 2011, at 1:58 PM, Rob Stewart wrote:
> Darius: Lastly, I am somewhat alarmed by your earlier comment: "all wildcard (i.e., MPI_ANY_SOURCE) receives will also complete with an error". I read this to mean that, for fault tolerant MPICH2 programs, one cannot use unspecified senders.
>
> Unspecified senders is fairly common. Load balancing work for instance, or task farms, divide and conquer parallelism, map reduce etc etc... And for my intended use of mpich2, there is no way for me to be able to specify known ranks for sending an receiving. Is there a way to turn this behaviour off ?
More information about the mpich-discuss
mailing list