[mpich-discuss] Question/problem with MPI mvapich hydra.

Tue Sep 27 04:09:26 CDT 2011

I tying to build master/slave application, which I execute using
mpiexec.hydra.
Master is always process with rank 0. All rest of processes are
slaves.(currently about ~10 processes).
Communication protocol (point to point):
*Slave:*
1) sends single integer to master 1000 times in loop using MPI_Send
operation
2) waits in MPI_Recv to recieve single integer from master.
3) Executes MPI_Finalize()

*Master:*
1) Master initialized with MPI_Errhandler_set(MPI_COMM_WORLD,
MPI_ERRORS_RETURN);
2) Master passes on cyclic buffer of slaves ranks, and listens to each one
of them by command MPI_Recv with slave rank.
Loop performed 1000 * Number of slaves.
3) After end of loop master sends to each slave 0 as integer using
MPI_Send operation.
4) Executes MPI_Finalize().

*Purpose of application*: Tolerance to Process Failures
On the failure of number of slaves, continue work with the rest ones.

*Execute command*:
mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec
/usr/bin/rsh -f machines.txt -n 11 mpi_send_sync

*machines.txt contains*:
student1-eth1:1
student2-eth1:3
student3-eth1:20

*Execution results*:
1) When I run application as written above, everything works.
2) Then I try to simulate failure of the one of slaves by calling abort()
 function on iteration 10 of the slave loop.
As result master get SIGUSR1 signal and fails.

*Questions*:
1) I don't understand what should I do in order to get an error status
from MPI_Recv command in master code?
2) In the case I use
MPI_Irecv + MPI_Waitany in master code, how can I recognize which slave
"dead"?

Anatoly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110927/6430dced/attachment.htm>