[mpich-discuss] Question/problem with MPI mvapich hydra.

Darius Buntinas buntinas at mcs.anl.gov
Tue Sep 27 14:10:42 CDT 2011


Are you calling MPI_Abort or the system call abort()?  MPI_Abort will abort the entire job.  I'm not sure why mpiexec would be sending sigusr1, though.  Try calling exit() instead and see if that works.

Can you send us a short program that demonstrates this?

Thanks,
-d


On Sep 27, 2011, at 4:09 AM, Anatoly G wrote:

> I tying to build master/slave application, which I execute using mpiexec.hydra.
> Master is always process with rank 0. All rest of processes are slaves.(currently about ~10 processes).
> Communication protocol (point to point):
> Slave: 
> 1) sends single integer to master 1000 times in loop using MPI_Send operation 
> 2) waits in MPI_Recv to recieve single integer from master.
> 3) Executes MPI_Finalize()
> 
> Master:
> 1) Master initialized with MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
> 2) Master passes on cyclic buffer of slaves ranks, and listens to each one of them by command MPI_Recv with slave rank.
> Loop performed 1000 * Number of slaves.
> 3) After end of loop master sends to each slave 0 as integer using 
> MPI_Send operation.
> 4) Executes MPI_Finalize().
> 
> Purpose of application: Tolerance to Process Failures
> On the failure of number of slaves, continue work with the rest ones.
> 
> Execute command:
> mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec /usr/bin/rsh -f machines.txt -n 11 mpi_send_sync
> 
> machines.txt contains:
> student1-eth1:1
> student2-eth1:3
> student3-eth1:20
> 
> Execution results:
> 1) When I run application as written above, everything works.
> 2) Then I try to simulate failure of the one of slaves by calling abort()  function on iteration 10 of the slave loop.
> As result master get SIGUSR1 signal and fails. 
> 
> Questions:
> 1) I don't understand what should I do in order to get an error status from MPI_Recv command in master code?
> 2) In the case I use 
> MPI_Irecv + MPI_Waitany in master code, how can I recognize which slave "dead"?
> 
> Anatoly.
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list