[mpich-discuss] disable-auto-cleanup send/receive example

Pavan Balaji balaji at mcs.anl.gov
Wed Nov 2 14:24:32 CDT 2011


I don't claim that I fully understand the problem here, but here are a 
few notes:

You cannot just "refresh" MPI_COMM_WORLD by resizing it, as the process 
ranks will be completely messed up if you do that. What you really want 
to do is create a new communicator with the remaining "alive" processes. 
You can use the MPIX_Group_comm_create function (see 
src/mpix/comm/group_comm.c) for this, but remember that in the next 
release the function name will change.

  -- Pavan

On 11/02/2011 01:21 PM, Rob Stewart wrote:
> Hi,
>
> I am trying to understand and use the --disable-auto-cleanup flag in mpich2.
>
> I have written a very simple example in C with mpi.
>
> Here is the code:
>
> http://pastebin.com/daHDtEBA
>
> Here's the output of a successful run:
> ---
> $ mpiexec --disable-auto-cleanup -machinefile hosts -n 10 delayed-hello
> Hello World from process 1 running on machine1
> Hello World from process 2 running on machine2
> Hello World from process 3 running on machine3
> Hello World from process 4 running on machine4
> Hello World from process 5 running on machine5
> Hello World from process 6 running on machine6
> Hello World from process 7 running on machine7
> Hello World from process 8 running on machine8
> Hello World from process 9 running on machine9
> Ready
>
>
> Now, here's what happens when I run it, and killing a process on a node.
> Note that I kill the node with rank 3 (process 3). So the mpi_send_msg()
> has been executed, and the rank0 machine has received and printed a
> "Hello World" line for this message...
>
> ---
> $ mpiexec --disable-auto-cleanup -machinefile hosts -n 10 delayed-hello
> Hello World from process 1 running on machine1
> Hello World from process 2 running on machine2
> Hello World from process 3 running on machine3
> Hello World from process 4 running on machine4
> Hello World from process 5 running on machine5
> Hello World from process 6 running on machine6
> Error in MPI_Recv!
> Hello World from process 6 running on machine6
> Error in MPI_Recv!
> Hello World from process 6 running on machine6
> Error in MPI_Recv!
> Hello World from process 6 running on machine6
> Ready
> Process 9: Error in MPI_Send!
> Process 7: Error in MPI_Send!
> Process 8: Error in MPI_Send!
> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>
>
> Perhaps naively, I had thought that because there would be no further
> communication with the process I had killed, it wouldn't make any
> difference to the runtime behaviour. But it did. I had also hoped that
> even if you killed a process *before* communication with a node, that
> mpich2 would just skip the communication attempt and continue, due to
> --disable-auto-cleanup. So if I were to kill process 7 in my example, I
> was hoping to see:
>
> ---
> $ mpiexec --disable-auto-cleanup -machinefile hosts -n 10 delayed-hello
> Hello World from process 1 running on machine1
> Hello World from process 2 running on machine2
> Hello World from process 3 running on machine3
> Hello World from process 4 running on machine4
> Hello World from process 5 running on machine5
> Hello World from process 6 running on machine6
> Hello World from process 8 running on machine8
> Hello World from process 9 running on machine9
> Ready
>
>
> Maybe, it is because that the MPI_COMM_WORLD is no longer valid.
> Initially, the world had 10 nodes, but when I kill a process, it has 9.
> So each subsequent attempt from live nodes to execute:
>
> MPI_Send(msg, length, MPI_CHAR, 0, tag, MPI_COMM_WORLD);
>
> So, is there any way in mpich2 to "refresh" MPI_COMM_WORLD, without a
> full MPI_INIT ?
>
> Something like:
>
> err = MPI_Send(msg, length, MPI_CHAR, 0, tag, MPI_COMM_WORLD);
>     if (err != MPI_SUCCESS) {
>       /* refresh_comm_world()
>          retry_send_msg() */
>     }
>
>
> Other than this, I cannot see an obvious way to take advantage of the
> --disable-auto-cleanup flag. Are there any canonical examples of C code,
> using this flag?
>
>

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list