[mpich-discuss] Question/problem with MPI mvapich hydra.

Anatoly G anatolyrishon at gmail.com
Tue Oct 25 02:30:33 CDT 2011


Initilization lines are:
MPI::Init(argc, argv);
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);

Execution command:
mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec
/usr/bin/rsh -f machines.txt -n 11 mpi_send_sync

Anatoly.


On Mon, Oct 24, 2011 at 10:17 PM, Darius Buntinas <buntinas at mcs.anl.gov>wrote:

>
> In MPI_Init, the signal handler should be installed, so SIGUSR1 shouldn't
> kill the process.
>
> Can you send us the configure line you used?
>
> -d
>
> On Oct 23, 2011, at 1:54 AM, Anatoly G wrote:
>
> > Sorry, I"m still don't understand.
> > When remote process fails, rest of processes get SIGUSR1, and by default
> are failed, because they don't have any signal handler.
> > If I"ll create signal handler for SIGUSR1, I can't detect that one of
> remote/local processes dead. How can I recognize which remote process dead.
> Signal has only local host process information.
> >
> > Anatoly.
> >
> >
> > On Mon, Oct 17, 2011 at 7:40 PM, Darius Buntinas <buntinas at mcs.anl.gov>
> wrote:
> >
> > On Oct 15, 2011, at 4:47 AM, Pavan Balaji wrote:
> >
> > >
> > > On 10/11/2011 02:35 PM, Darius Buntinas wrote:
> > >> I took a look at your code.  Mpiexec will send a SIGUSR1 signal to
> > >> each process to notify it of a failed process (Oops, I forgot about
> > >> that when I responded to your previous email).  If you need a signal
> > >> for your application, you'll need to choose another one.  The signal
> > >> handler you installed replaced MPICH's signal handler, so the library
> > >> wasn't able to detect that the process had failed.
> > >
> > > Anatoly: In stacked libraries, you are supposed to chain signal
> handlers. Replacing another library's signal handlers can lead to unexpected
> behavior.
> >
> > If you set the signal handler before calling MPI_Init, MPICH will chain
> your signal handler.
> >
> > >
> > >> Another problem is that MPI_Abort() isn't killing all processes, so
> > >> when I commented out CreateOwnSignalHandler(), the master detected
> > >> the failure and called MPI_Abort(), but some slave processes were
> > >> still hanging in MPI_Barrier().  We'll need to fix that.
> > >
> > > Darius: What's the expected behavior here? Should a regular exit look
> at whether the user asked for a cleanup or not, and an abort kill all
> processes?
> >
> > That's what I think it should do.  MPI_Abort should kill all processes in
> the specified communicator.  If you can't kill only the processes in the
> communicator, then it should kill all connected processes (i.e., the job,
> plus any dynamic procs).
> >
> > -d
> >
> > > -- Pavan
> > >
> > > --
> > > Pavan Balaji
> > > http://www.mcs.anl.gov/~balaji
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111025/afb30e31/attachment.htm>


More information about the mpich-discuss mailing list