[mpich-discuss] disable-auto-cleanup send/receive example

Rob Stewart R.Stewart at hw.ac.uk
Wed Nov 2 13:21:47 CDT 2011


Hi,

I am trying to understand and use the --disable-auto-cleanup flag in mpich2.

I have written a very simple example in C with mpi.

Here is the code:

http://pastebin.com/daHDtEBA

Here's the output of a successful run:
---
$ mpiexec --disable-auto-cleanup -machinefile hosts -n 10 delayed-hello
Hello World from process 1 running on machine1
Hello World from process 2 running on machine2
Hello World from process 3 running on machine3
Hello World from process 4 running on machine4
Hello World from process 5 running on machine5
Hello World from process 6 running on machine6
Hello World from process 7 running on machine7
Hello World from process 8 running on machine8
Hello World from process 9 running on machine9
Ready


Now, here's what happens when I run it, and killing a process on a node.
Note that I kill the node with rank 3 (process 3). So the mpi_send_msg() 
has been executed, and the rank0 machine has received and printed a 
"Hello World" line for this message...

---
$ mpiexec --disable-auto-cleanup -machinefile hosts -n 10 delayed-hello
Hello World from process 1 running on machine1
Hello World from process 2 running on machine2
Hello World from process 3 running on machine3
Hello World from process 4 running on machine4
Hello World from process 5 running on machine5
Hello World from process 6 running on machine6
Error in MPI_Recv!
Hello World from process 6 running on machine6
Error in MPI_Recv!
Hello World from process 6 running on machine6
Error in MPI_Recv!
Hello World from process 6 running on machine6
Ready
Process 9: Error in MPI_Send!
Process 7: Error in MPI_Send!
Process 8: Error in MPI_Send!
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)


Perhaps naively, I had thought that because there would be no further 
communication with the process I had killed, it wouldn't make any 
difference to the runtime behaviour. But it did. I had also hoped that 
even if you killed a process *before* communication with a node, that 
mpich2 would just skip the communication attempt and continue, due to 
--disable-auto-cleanup. So if I were to kill process 7 in my example, I 
was hoping to see:

---
$ mpiexec --disable-auto-cleanup -machinefile hosts -n 10 delayed-hello
Hello World from process 1 running on machine1
Hello World from process 2 running on machine2
Hello World from process 3 running on machine3
Hello World from process 4 running on machine4
Hello World from process 5 running on machine5
Hello World from process 6 running on machine6
Hello World from process 8 running on machine8
Hello World from process 9 running on machine9
Ready


Maybe, it is because that the MPI_COMM_WORLD is no longer valid. 
Initially, the world had 10 nodes, but when I kill a process, it has 9. 
So each subsequent attempt from live nodes to execute:

MPI_Send(msg, length, MPI_CHAR, 0, tag, MPI_COMM_WORLD);

So, is there any way in mpich2 to "refresh" MPI_COMM_WORLD, without a 
full MPI_INIT ?

Something like:

err = MPI_Send(msg, length, MPI_CHAR, 0, tag, MPI_COMM_WORLD);
   if (err != MPI_SUCCESS) {
     /* refresh_comm_world()
        retry_send_msg() */
   }


Other than this, I cannot see an obvious way to take advantage of the 
--disable-auto-cleanup flag. Are there any canonical examples of C code, 
using this flag?


-- 
Rob Stewart


-- 
Heriot-Watt University is a Scottish charity
registered under charity number SC000278.

Heriot-Watt University is the Sunday Times
Scottish University of the Year 2011-2012




More information about the mpich-discuss mailing list