[mpich-discuss] Question/problem with MPI mvapich hydra.

Anatoly G anatolyrishon at gmail.com
Sat Oct 15 09:03:00 CDT 2011

The problem is, that I need at master size to detect which one of slaves
failed, delete it from my distribution list, and continue to work with only
live slaves. The questions are:
1) What I should do in order to recognize, which slave dead?
2) How cat I get slave's fail status: some info about failure?


On Sat, Oct 15, 2011 at 11:47 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

> On 10/11/2011 02:35 PM, Darius Buntinas wrote:
>> I took a look at your code.  Mpiexec will send a SIGUSR1 signal to
>> each process to notify it of a failed process (Oops, I forgot about
>> that when I responded to your previous email).  If you need a signal
>> for your application, you'll need to choose another one.  The signal
>> handler you installed replaced MPICH's signal handler, so the library
>> wasn't able to detect that the process had failed.
> Anatoly: In stacked libraries, you are supposed to chain signal handlers.
> Replacing another library's signal handlers can lead to unexpected behavior.
>  Another problem is that MPI_Abort() isn't killing all processes, so
>> when I commented out CreateOwnSignalHandler(), the master detected
>> the failure and called MPI_Abort(), but some slave processes were
>> still hanging in MPI_Barrier().  We'll need to fix that.
> Darius: What's the expected behavior here? Should a regular exit look at
> whether the user asked for a cleanup or not, and an abort kill all
> processes?
>  -- Pavan
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> ______________________________**_________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/**mailman/listinfo/mpich-discuss<https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111015/5f39e2de/attachment-0001.htm>

More information about the mpich-discuss mailing list