[mpich-discuss] Question/problem with MPI mvapich hydra.

Anatoly G anatolyrishon at gmail.com
Tue Dec 6 01:40:52 CST 2011


I" currently use *mpich2*.
*Configuration*:

   I'm using plain ./configure no options

   ./configure --prefix=/space/local/mpich2

*Execution:*

mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec
/usr/bin/rsh -f machines2.txt -n 10 mpi_send_rec_testany 1000 10000 2 20 1
logs/res_test


*Success run output on screen:*

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Floating point exception
(signal 8)

This typically refers to a problem with your application.

Please see the FAQ page for debugging suggestions

*Not success run output on screen:*

control_cb (./pm/pmiserv/pmiserv_cb.c:321): assert (!closed) failed

HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback
returned error status

HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error
waiting for event

[mpiexec at student1-ib0] main (./ui/mpich/mpiexec.c:420): process manager
error waiting for completion


The program is not stable. It recognizes fail of slave 1 as expected, but
not always completed successfully.

Currently I"m ignore SIGUSR1 (overwrite of signal handler).


Can you please tell me:


   - what should I do in order to stabilize my test. How should I refer to
   SIGUSR1 in the case of failure.
   - This test uses polling mechanism implemented via MPI_Test per
   connection. What should I do to get same results using MPI_Waitany. How can
   I recognize fail process rank, exclude it from communication, and continue
   working with survived processes.





On Tue, Oct 25, 2011 at 5:27 PM, Darius Buntinas <buntinas at mcs.anl.gov>wrote:

> It looks like you're using MVAPICH.  I don't believe MVAPICH supports the
> fault-tolerance features you're looking for.
>
> You'll need to use MPICH2 with the default channel (i.e., don't specify
> --with-device=...), leave out --enable-threads (it's the default already)
> and don't use mpd (leave out --with-pm=...)
>
> -d
>
> On Oct 25, 2011, at 10:16 AM, Anatoly G wrote:
>
> > Compilation performed by my boss.
> > Configuration:
> > ./configure --with-device=ch3:sock --enable-debuginfo
> --prefix=/space/local/mvapich2 CFLAGS=-fPIC --enable-shared
> --enable-threads --enable-sharedlibs=gcc --with-pm=mpd:hydra
> >
> > mvapich2-1.7rc2
> >
> >
> > Anatoly.
> >
> > On Tue, Oct 25, 2011 at 4:17 PM, Darius Buntinas <buntinas at mcs.anl.gov>
> wrote:
> >
> > Did you configure and compile MPICH2 yourself?  If you did, please send
> us the command you used to configure it (e.g., ./configure --prefix=...).
> >
> > If you didn't compile it yourself, you'll need to talk to the person who
> did to get that information.
> >
> > Also, what version of MPICH2 are you using?
> >
> > -d
> >
> > On Oct 25, 2011, at 2:30 AM, Anatoly G wrote:
> >
> > > Initilization lines are:
> > > MPI::Init(argc, argv);
> > > MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
> > >
> > > Execution command:
> > > mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec
> /usr/bin/rsh -f machines.txt -n 11 mpi_send_sync
> > >
> > > Anatoly.
> > >
> > >
> > > On Mon, Oct 24, 2011 at 10:17 PM, Darius Buntinas <
> buntinas at mcs.anl.gov> wrote:
> > >
> > > In MPI_Init, the signal handler should be installed, so SIGUSR1
> shouldn't kill the process.
> > >
> > > Can you send us the configure line you used?
> > >
> > > -d
> > >
> > > On Oct 23, 2011, at 1:54 AM, Anatoly G wrote:
> > >
> > > > Sorry, I"m still don't understand.
> > > > When remote process fails, rest of processes get SIGUSR1, and by
> default are failed, because they don't have any signal handler.
> > > > If I"ll create signal handler for SIGUSR1, I can't detect that one
> of remote/local processes dead. How can I recognize which remote process
> dead. Signal has only local host process information.
> > > >
> > > > Anatoly.
> > > >
> > > >
> > > > On Mon, Oct 17, 2011 at 7:40 PM, Darius Buntinas <
> buntinas at mcs.anl.gov> wrote:
> > > >
> > > > On Oct 15, 2011, at 4:47 AM, Pavan Balaji wrote:
> > > >
> > > > >
> > > > > On 10/11/2011 02:35 PM, Darius Buntinas wrote:
> > > > >> I took a look at your code.  Mpiexec will send a SIGUSR1 signal to
> > > > >> each process to notify it of a failed process (Oops, I forgot
> about
> > > > >> that when I responded to your previous email).  If you need a
> signal
> > > > >> for your application, you'll need to choose another one.  The
> signal
> > > > >> handler you installed replaced MPICH's signal handler, so the
> library
> > > > >> wasn't able to detect that the process had failed.
> > > > >
> > > > > Anatoly: In stacked libraries, you are supposed to chain signal
> handlers. Replacing another library's signal handlers can lead to
> unexpected behavior.
> > > >
> > > > If you set the signal handler before calling MPI_Init, MPICH will
> chain your signal handler.
> > > >
> > > > >
> > > > >> Another problem is that MPI_Abort() isn't killing all processes,
> so
> > > > >> when I commented out CreateOwnSignalHandler(), the master detected
> > > > >> the failure and called MPI_Abort(), but some slave processes were
> > > > >> still hanging in MPI_Barrier().  We'll need to fix that.
> > > > >
> > > > > Darius: What's the expected behavior here? Should a regular exit
> look at whether the user asked for a cleanup or not, and an abort kill all
> processes?
> > > >
> > > > That's what I think it should do.  MPI_Abort should kill all
> processes in the specified communicator.  If you can't kill only the
> processes in the communicator, then it should kill all connected processes
> (i.e., the job, plus any dynamic procs).
> > > >
> > > > -d
> > > >
> > > > > -- Pavan
> > > > >
> > > > > --
> > > > > Pavan Balaji
> > > > > http://www.mcs.anl.gov/~balaji
> > > >
> > > > _______________________________________________
> > > > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > > > To manage subscription options or unsubscribe:
> > > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > > >
> > > > _______________________________________________
> > > > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > > > To manage subscription options or unsubscribe:
> > > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > > To manage subscription options or unsubscribe:
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > > To manage subscription options or unsubscribe:
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111206/818c76a4/attachment-0001.htm>
-------------- next part --------------
student1-eth1:1
student2-eth1:3
student3-eth1:3
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_send_rec_testany.cpp
Type: text/x-c++src
Size: 15358 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111206/818c76a4/attachment-0001.cpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_test_incl.h
Type: text/x-chdr
Size: 7924 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111206/818c76a4/attachment-0001.h>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: success_res_test_r0.log
Type: application/octet-stream
Size: 1434 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111206/818c76a4/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not_success_res_test_r0.log
Type: application/octet-stream
Size: 117 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111206/818c76a4/attachment-0003.obj>


More information about the mpich-discuss mailing list