[mpich-discuss] Question/problem with MPI mvapich hydra.
Pavan Balaji
balaji at mcs.anl.gov
Sat Oct 15 04:47:55 CDT 2011
On 10/11/2011 02:35 PM, Darius Buntinas wrote:
> I took a look at your code. Mpiexec will send a SIGUSR1 signal to
> each process to notify it of a failed process (Oops, I forgot about
> that when I responded to your previous email). If you need a signal
> for your application, you'll need to choose another one. The signal
> handler you installed replaced MPICH's signal handler, so the library
> wasn't able to detect that the process had failed.
Anatoly: In stacked libraries, you are supposed to chain signal
handlers. Replacing another library's signal handlers can lead to
unexpected behavior.
> Another problem is that MPI_Abort() isn't killing all processes, so
> when I commented out CreateOwnSignalHandler(), the master detected
> the failure and called MPI_Abort(), but some slave processes were
> still hanging in MPI_Barrier(). We'll need to fix that.
Darius: What's the expected behavior here? Should a regular exit look at
whether the user asked for a cleanup or not, and an abort kill all
processes?
-- Pavan
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list