[mpich-discuss] Question/problem with MPI mvapich hydra.

Anatoly G anatolyrishon at gmail.com
Tue Oct 25 10:16:27 CDT 2011


Compilation performed by my boss.
Configuration:

./configure --with-device=ch3:sock --enable-debuginfo
 --prefix=/space/local/mvapich2 CFLAGS=-fPIC --enable-shared
--enable-threads --enable-sharedlibs=gcc --with-pm=mpd:hydra

mvapich2-1.7rc2

Anatoly.

On Tue, Oct 25, 2011 at 4:17 PM, Darius Buntinas <buntinas at mcs.anl.gov>wrote:

>
> Did you configure and compile MPICH2 yourself?  If you did, please send us
> the command you used to configure it (e.g., ./configure --prefix=...).
>
> If you didn't compile it yourself, you'll need to talk to the person who
> did to get that information.
>
> Also, what version of MPICH2 are you using?
>
> -d
>
> On Oct 25, 2011, at 2:30 AM, Anatoly G wrote:
>
> > Initilization lines are:
> > MPI::Init(argc, argv);
> > MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
> >
> > Execution command:
> > mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec
> /usr/bin/rsh -f machines.txt -n 11 mpi_send_sync
> >
> > Anatoly.
> >
> >
> > On Mon, Oct 24, 2011 at 10:17 PM, Darius Buntinas <buntinas at mcs.anl.gov>
> wrote:
> >
> > In MPI_Init, the signal handler should be installed, so SIGUSR1 shouldn't
> kill the process.
> >
> > Can you send us the configure line you used?
> >
> > -d
> >
> > On Oct 23, 2011, at 1:54 AM, Anatoly G wrote:
> >
> > > Sorry, I"m still don't understand.
> > > When remote process fails, rest of processes get SIGUSR1, and by
> default are failed, because they don't have any signal handler.
> > > If I"ll create signal handler for SIGUSR1, I can't detect that one of
> remote/local processes dead. How can I recognize which remote process dead.
> Signal has only local host process information.
> > >
> > > Anatoly.
> > >
> > >
> > > On Mon, Oct 17, 2011 at 7:40 PM, Darius Buntinas <buntinas at mcs.anl.gov>
> wrote:
> > >
> > > On Oct 15, 2011, at 4:47 AM, Pavan Balaji wrote:
> > >
> > > >
> > > > On 10/11/2011 02:35 PM, Darius Buntinas wrote:
> > > >> I took a look at your code.  Mpiexec will send a SIGUSR1 signal to
> > > >> each process to notify it of a failed process (Oops, I forgot about
> > > >> that when I responded to your previous email).  If you need a signal
> > > >> for your application, you'll need to choose another one.  The signal
> > > >> handler you installed replaced MPICH's signal handler, so the
> library
> > > >> wasn't able to detect that the process had failed.
> > > >
> > > > Anatoly: In stacked libraries, you are supposed to chain signal
> handlers. Replacing another library's signal handlers can lead to unexpected
> behavior.
> > >
> > > If you set the signal handler before calling MPI_Init, MPICH will chain
> your signal handler.
> > >
> > > >
> > > >> Another problem is that MPI_Abort() isn't killing all processes, so
> > > >> when I commented out CreateOwnSignalHandler(), the master detected
> > > >> the failure and called MPI_Abort(), but some slave processes were
> > > >> still hanging in MPI_Barrier().  We'll need to fix that.
> > > >
> > > > Darius: What's the expected behavior here? Should a regular exit look
> at whether the user asked for a cleanup or not, and an abort kill all
> processes?
> > >
> > > That's what I think it should do.  MPI_Abort should kill all processes
> in the specified communicator.  If you can't kill only the processes in the
> communicator, then it should kill all connected processes (i.e., the job,
> plus any dynamic procs).
> > >
> > > -d
> > >
> > > > -- Pavan
> > > >
> > > > --
> > > > Pavan Balaji
> > > > http://www.mcs.anl.gov/~balaji
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > > To manage subscription options or unsubscribe:
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > > To manage subscription options or unsubscribe:
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111025/0b305a2c/attachment-0001.htm>


More information about the mpich-discuss mailing list