<div dir="ltr">Thank you for your's reply.<div>I"m sorry, it take me answer a lot of time, I was ill.</div><div>I use system abort, exit is more normal program termination, I tried to simulate unexpected program termination.</div>
<div>Attached h, cpp files is an example.</div><div>Execution commnad:</div><div>mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec /usr/bin/rsh -f machines2.txt -n 6 mpi_send_sync 100 1000 3 50 1 logs/test</div>
<div><br></div><div>Anatoly.<br><br><div class="gmail_quote">On Tue, Sep 27, 2011 at 9:10 PM, Darius Buntinas <span dir="ltr"><<a href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Are you calling MPI_Abort or the system call abort()? MPI_Abort will abort the entire job. I'm not sure why mpiexec would be sending sigusr1, though. Try calling exit() instead and see if that works.<br>
<br>
Can you send us a short program that demonstrates this?<br>
<br>
Thanks,<br>
-d<br>
<div><div></div><div class="h5"><br>
<br>
On Sep 27, 2011, at 4:09 AM, Anatoly G wrote:<br>
<br>
> I tying to build master/slave application, which I execute using mpiexec.hydra.<br>
> Master is always process with rank 0. All rest of processes are slaves.(currently about ~10 processes).<br>
> Communication protocol (point to point):<br>
> Slave:<br>
> 1) sends single integer to master 1000 times in loop using MPI_Send operation<br>
> 2) waits in MPI_Recv to recieve single integer from master.<br>
> 3) Executes MPI_Finalize()<br>
><br>
> Master:<br>
> 1) Master initialized with MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);<br>
> 2) Master passes on cyclic buffer of slaves ranks, and listens to each one of them by command MPI_Recv with slave rank.<br>
> Loop performed 1000 * Number of slaves.<br>
> 3) After end of loop master sends to each slave 0 as integer using<br>
> MPI_Send operation.<br>
> 4) Executes MPI_Finalize().<br>
><br>
> Purpose of application: Tolerance to Process Failures<br>
> On the failure of number of slaves, continue work with the rest ones.<br>
><br>
> Execute command:<br>
> mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec /usr/bin/rsh -f machines.txt -n 11 mpi_send_sync<br>
><br>
> machines.txt contains:<br>
> student1-eth1:1<br>
> student2-eth1:3<br>
> student3-eth1:20<br>
><br>
> Execution results:<br>
> 1) When I run application as written above, everything works.<br>
> 2) Then I try to simulate failure of the one of slaves by calling abort() function on iteration 10 of the slave loop.<br>
> As result master get SIGUSR1 signal and fails.<br>
><br>
> Questions:<br>
> 1) I don't understand what should I do in order to get an error status from MPI_Recv command in master code?<br>
> 2) In the case I use<br>
> MPI_Irecv + MPI_Waitany in master code, how can I recognize which slave "dead"?<br>
><br>
> Anatoly.<br>
><br>
</div></div>> _______________________________________________<br>
> mpich-discuss mailing list <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
<br>
_______________________________________________<br>
mpich-discuss mailing list <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
</blockquote></div><br></div></div>