[mpich-discuss] Question/problem with MPI mvapich hydra.

Pavan Balaji balaji at mcs.anl.gov
Sat Oct 15 04:47:55 CDT 2011


On 10/11/2011 02:35 PM, Darius Buntinas wrote:
> I took a look at your code.  Mpiexec will send a SIGUSR1 signal to
> each process to notify it of a failed process (Oops, I forgot about
> that when I responded to your previous email).  If you need a signal
> for your application, you'll need to choose another one.  The signal
> handler you installed replaced MPICH's signal handler, so the library
> wasn't able to detect that the process had failed.

Anatoly: In stacked libraries, you are supposed to chain signal 
handlers. Replacing another library's signal handlers can lead to 
unexpected behavior.

> Another problem is that MPI_Abort() isn't killing all processes, so
> when I commented out CreateOwnSignalHandler(), the master detected
> the failure and called MPI_Abort(), but some slave processes were
> still hanging in MPI_Barrier().  We'll need to fix that.

Darius: What's the expected behavior here? Should a regular exit look at 
whether the user asked for a cleanup or not, and an abort kill all 
processes?

  -- Pavan

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list