[mpich-discuss] disable-auto-cleanup send/receive example
Rob Stewart
R.Stewart at hw.ac.uk
Wed Nov 2 13:21:47 CDT 2011
Hi,
I am trying to understand and use the --disable-auto-cleanup flag in mpich2.
I have written a very simple example in C with mpi.
Here is the code:
http://pastebin.com/daHDtEBA
Here's the output of a successful run:
---
$ mpiexec --disable-auto-cleanup -machinefile hosts -n 10 delayed-hello
Hello World from process 1 running on machine1
Hello World from process 2 running on machine2
Hello World from process 3 running on machine3
Hello World from process 4 running on machine4
Hello World from process 5 running on machine5
Hello World from process 6 running on machine6
Hello World from process 7 running on machine7
Hello World from process 8 running on machine8
Hello World from process 9 running on machine9
Ready
Now, here's what happens when I run it, and killing a process on a node.
Note that I kill the node with rank 3 (process 3). So the mpi_send_msg()
has been executed, and the rank0 machine has received and printed a
"Hello World" line for this message...
---
$ mpiexec --disable-auto-cleanup -machinefile hosts -n 10 delayed-hello
Hello World from process 1 running on machine1
Hello World from process 2 running on machine2
Hello World from process 3 running on machine3
Hello World from process 4 running on machine4
Hello World from process 5 running on machine5
Hello World from process 6 running on machine6
Error in MPI_Recv!
Hello World from process 6 running on machine6
Error in MPI_Recv!
Hello World from process 6 running on machine6
Error in MPI_Recv!
Hello World from process 6 running on machine6
Ready
Process 9: Error in MPI_Send!
Process 7: Error in MPI_Send!
Process 8: Error in MPI_Send!
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
Perhaps naively, I had thought that because there would be no further
communication with the process I had killed, it wouldn't make any
difference to the runtime behaviour. But it did. I had also hoped that
even if you killed a process *before* communication with a node, that
mpich2 would just skip the communication attempt and continue, due to
--disable-auto-cleanup. So if I were to kill process 7 in my example, I
was hoping to see:
---
$ mpiexec --disable-auto-cleanup -machinefile hosts -n 10 delayed-hello
Hello World from process 1 running on machine1
Hello World from process 2 running on machine2
Hello World from process 3 running on machine3
Hello World from process 4 running on machine4
Hello World from process 5 running on machine5
Hello World from process 6 running on machine6
Hello World from process 8 running on machine8
Hello World from process 9 running on machine9
Ready
Maybe, it is because that the MPI_COMM_WORLD is no longer valid.
Initially, the world had 10 nodes, but when I kill a process, it has 9.
So each subsequent attempt from live nodes to execute:
MPI_Send(msg, length, MPI_CHAR, 0, tag, MPI_COMM_WORLD);
So, is there any way in mpich2 to "refresh" MPI_COMM_WORLD, without a
full MPI_INIT ?
Something like:
err = MPI_Send(msg, length, MPI_CHAR, 0, tag, MPI_COMM_WORLD);
if (err != MPI_SUCCESS) {
/* refresh_comm_world()
retry_send_msg() */
}
Other than this, I cannot see an obvious way to take advantage of the
--disable-auto-cleanup flag. Are there any canonical examples of C code,
using this flag?
--
Rob Stewart
--
Heriot-Watt University is a Scottish charity
registered under charity number SC000278.
Heriot-Watt University is the Sunday Times
Scottish University of the Year 2011-2012
More information about the mpich-discuss
mailing list