[mpich-discuss] Question/problem with MPI mvapich hydra.
Darius Buntinas
buntinas at mcs.anl.gov
Tue Sep 27 14:10:42 CDT 2011
Are you calling MPI_Abort or the system call abort()? MPI_Abort will abort the entire job. I'm not sure why mpiexec would be sending sigusr1, though. Try calling exit() instead and see if that works.
Can you send us a short program that demonstrates this?
Thanks,
-d
On Sep 27, 2011, at 4:09 AM, Anatoly G wrote:
> I tying to build master/slave application, which I execute using mpiexec.hydra.
> Master is always process with rank 0. All rest of processes are slaves.(currently about ~10 processes).
> Communication protocol (point to point):
> Slave:
> 1) sends single integer to master 1000 times in loop using MPI_Send operation
> 2) waits in MPI_Recv to recieve single integer from master.
> 3) Executes MPI_Finalize()
>
> Master:
> 1) Master initialized with MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
> 2) Master passes on cyclic buffer of slaves ranks, and listens to each one of them by command MPI_Recv with slave rank.
> Loop performed 1000 * Number of slaves.
> 3) After end of loop master sends to each slave 0 as integer using
> MPI_Send operation.
> 4) Executes MPI_Finalize().
>
> Purpose of application: Tolerance to Process Failures
> On the failure of number of slaves, continue work with the rest ones.
>
> Execute command:
> mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec /usr/bin/rsh -f machines.txt -n 11 mpi_send_sync
>
> machines.txt contains:
> student1-eth1:1
> student2-eth1:3
> student3-eth1:20
>
> Execution results:
> 1) When I run application as written above, everything works.
> 2) Then I try to simulate failure of the one of slaves by calling abort() function on iteration 10 of the slave loop.
> As result master get SIGUSR1 signal and fails.
>
> Questions:
> 1) I don't understand what should I do in order to get an error status from MPI_Recv command in master code?
> 2) In the case I use
> MPI_Irecv + MPI_Waitany in master code, how can I recognize which slave "dead"?
>
> Anatoly.
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list