[mpich-discuss] Question/problem with MPI mvapich hydra.

Darius Buntinas buntinas at mcs.anl.gov
Mon Oct 17 12:52:06 CDT 2011


Look at Section 7 (Fault Tolerance) in the README file.  This should explain how to do this.  One thing that the README isn't clear about is that the list of failed processes is not updated immediately when the signal handler is called, so the application should call the progress engine (e.g., by calling Iprobe) after returning from the signal handler before reading the list of failed processes in order to make sure it has been updated.

-d

On Oct 15, 2011, at 10:25 AM, Pavan Balaji wrote:

> 
> On 10/15/2011 09:03 AM, Anatoly G wrote:
>> The problem is, that I need at master size to detect which one of slaves
>> failed, delete it from my distribution list, and continue to work with
>> only live slaves. The questions are:
>> 1) What I should do in order to recognize, which slave dead?
> 
> The signal handler that Darius mentioned should work. It's just that if you are using SIGUSR1, you cannot overwrite what is set by MPICH2. You need to chain them, i.e., override the signal handler, do whatever in your signal handler and then call the old signal handler once you are done.
> 
>> 2) How cat I get slave's fail status: some info about failure?
> 
> I'll let Darius answer this.
> 
> -- Pavan
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list