[mpich-discuss] Question/problem with fault tolerance of mpich2 hydra.
Anatoly G
anatolyrishon at gmail.com
Mon Dec 19 05:06:54 CST 2011
Dear Mpich2.
I currently use *mpich2*.
*I have a problem with Fault tolerance feature:*
My program runs as master/slaves application.
Master uses asynchronous MPI_IRecv operations & polling using MPI_Test.
When one of the slaves fails, all application fails.
*Question:*
What should I do to keep application alive?
*Mpich2 configuration*:
$ ./configure --with-ftb=/space/local --prefix=/space/local
*
*
*Execution command:*
mpiexec.hydra -genvall -f machines_student.txt -n 3 -launcher=rsh
mpi_send_rec_testany 1000 10000 2 20 1 logs/res_test
*Error handler: *MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
*
*
*Output on the screen:*
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 136
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
[proxy:0:0 at student1-ib0] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
[proxy:0:0 at student1-ib0] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at student1-ib0] main (./pm/pmiserv/pmip.c:225): demux engine error
waiting for event
[mpiexec at student1-ib0] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated
badly; aborting
[mpiexec at student1-ib0] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at student1-ib0] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for
completion
[mpiexec at student1-ib0] main (./ui/mpich/mpiexec.c:420): process manager
error waiting for completion
Remark:
When all saves alive, application works fine.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111219/1ec12380/attachment-0001.htm>
-------------- next part --------------
student1-eth1
student2-eth1
student3-eth1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_send_rec_testany.cpp
Type: text/x-c++src
Size: 15358 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111219/1ec12380/attachment-0001.cpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_test_incl.h
Type: text/x-chdr
Size: 7924 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111219/1ec12380/attachment-0001.h>
More information about the mpich-discuss
mailing list