<div dir="ltr">Sorry, I"m still don't understand.<div>When remote process fails, rest of processes get SIGUSR1, and by default are failed, because they don't have any signal handler.</div><div>If I"ll create signal handler for SIGUSR1, I can't detect that one of remote/local processes dead. How can I recognize which remote process dead. Signal has only local host process information.</div>
<div><br></div><div>Anatoly.</div><div><br><br><div class="gmail_quote">On Mon, Oct 17, 2011 at 7:40 PM, Darius Buntinas <span dir="ltr"><<a href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im"><br>
On Oct 15, 2011, at 4:47 AM, Pavan Balaji wrote:<br>
<br>
><br>
> On 10/11/2011 02:35 PM, Darius Buntinas wrote:<br>
>> I took a look at your code. Mpiexec will send a SIGUSR1 signal to<br>
>> each process to notify it of a failed process (Oops, I forgot about<br>
>> that when I responded to your previous email). If you need a signal<br>
>> for your application, you'll need to choose another one. The signal<br>
>> handler you installed replaced MPICH's signal handler, so the library<br>
>> wasn't able to detect that the process had failed.<br>
><br>
> Anatoly: In stacked libraries, you are supposed to chain signal handlers. Replacing another library's signal handlers can lead to unexpected behavior.<br>
<br>
</div>If you set the signal handler before calling MPI_Init, MPICH will chain your signal handler.<br>
<div class="im"><br>
><br>
>> Another problem is that MPI_Abort() isn't killing all processes, so<br>
>> when I commented out CreateOwnSignalHandler(), the master detected<br>
>> the failure and called MPI_Abort(), but some slave processes were<br>
>> still hanging in MPI_Barrier(). We'll need to fix that.<br>
><br>
> Darius: What's the expected behavior here? Should a regular exit look at whether the user asked for a cleanup or not, and an abort kill all processes?<br>
<br>
</div>That's what I think it should do. MPI_Abort should kill all processes in the specified communicator. If you can't kill only the processes in the communicator, then it should kill all connected processes (i.e., the job, plus any dynamic procs).<br>
<font color="#888888"><br>
-d<br>
</font><div><div></div><div class="h5"><br>
> -- Pavan<br>
><br>
> --<br>
> Pavan Balaji<br>
> <a href="http://www.mcs.anl.gov/~balaji" target="_blank">http://www.mcs.anl.gov/~balaji</a><br>
<br>
_______________________________________________<br>
mpich-discuss mailing list <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
</div></div></blockquote></div><br></div></div>