[mpich-discuss] Question/problem with fault tolerance of mpich2 hydra.

Anatoly G anatolyrishon at gmail.com
Mon Dec 19 05:06:54 CST 2011


Dear Mpich2.

I currently use *mpich2*.

*I have a problem with Fault tolerance feature:*
My program runs as master/slaves application.
Master uses asynchronous MPI_IRecv operations & polling using MPI_Test.
When one of the slaves fails, all application fails.

*Question:*
What should I do to keep application alive?

*Mpich2 configuration*:

   $ ./configure --with-ftb=/space/local --prefix=/space/local

*
*

*Execution command:*

mpiexec.hydra -genvall -f machines_student.txt -n 3 -launcher=rsh
mpi_send_rec_testany 1000 10000 2 20 1 logs/res_test


*Error handler: *MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);

*
*

*Output on the screen:*

=====================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   EXIT CODE: 136

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

=====================================================================================

[proxy:0:0 at student1-ib0] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed

[proxy:0:0 at student1-ib0] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status

[proxy:0:0 at student1-ib0] main (./pm/pmiserv/pmip.c:225): demux engine error
waiting for event

[mpiexec at student1-ib0] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated
badly; aborting

[mpiexec at student1-ib0] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion

[mpiexec at student1-ib0] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for
completion

[mpiexec at student1-ib0] main (./ui/mpich/mpiexec.c:420): process manager
error waiting for completion


Remark:
When all saves alive, application works fine.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111219/1ec12380/attachment-0001.htm>
-------------- next part --------------
student1-eth1
student2-eth1
student3-eth1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_send_rec_testany.cpp
Type: text/x-c++src
Size: 15358 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111219/1ec12380/attachment-0001.cpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_test_incl.h
Type: text/x-chdr
Size: 7924 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111219/1ec12380/attachment-0001.h>


More information about the mpich-discuss mailing list