[mpich-discuss] Fault tolerance problem.

Anatoly G anatolyrishon at gmail.com
Sun Jan 8 06:04:33 CST 2012


Dear mpich-discuss,
I have a problem while using fault tolerance feature, in MPICH2 hydra
process manager.
The results are not consistent, sometimes tests pass, sometimes stall.
If you executes command line written below in loop, after number of
iterations, test stall.
Can you please help me with this problem?

There are 3 tests. All 3 tests have same model master with number of
slaves. Communication operations are point to point.
*
*
*Slave algorithm is same for all 3 tests.*
for N times:
    MPI_Send integer to master.
    if IterI (parameter) && rank= fail_rank
        cause divide by zero exception. (A = 5.0; B = 0.0;  C = A / B;)
MPI_Recv(master)
*
*
*Master algorithm Test1 (mpi_send_rcv_waitany.cpp) :*

   - For each slave call MPI_Irecv
   - while not got N messages from each slave continue
   -       MPI_Waitany(slaveIdx)
   -       if slaveIdx alive
   -          MPI_Irecv(slaveIdx)
   -       else
   -          Mark it as finished.
   - MPI_Send to all slaves .


*Master algorithm Test2 (mpi_send_sync.cpp) :*

   - slave = first slave
   - while not got N messages from each slave continue
   -       MPI_Recv(slave)
   -       if slaveIdx alive
   -          pass to next live slave
   -       else
   -          Mark it as finished.
   - MPI_Send to all slaves .


*Master algorithm Test3 (mpi_send_async.cpp) :   *
Same as test2, but instead of MPI_Recv, I use MPI_Irecv + MPI_Wait

When test stall, I connect debugger to master process.
Process stall in MPI_recv, or MPI_Irecv.
I think, stall caused by following sequence:

   - Master receives integer from slave.
   - Tests slave - it's Ok.
   - Slave failes
   - master try to perform MPI_Irecv or MPI_Recv on failed slave.

Problem happens on cluster (student_machines.txt) & on single machine
(machine_student1.txt)

Execution lines:

   - /space/local/hydra/bin/mpiexec.hydra  -genvall  -disable-auto-cleanup
    -f machine_student1.txt  -n 8  -launcher=rsh mpi_rcv_waitany 100000
   1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_
   - /space/local/hydra/bin/mpiexec.hydra  -genvall  -disable-auto-cleanup
    -f student_machines.txt  -n 12  -launcher=rsh mpi_rcv_waitany 100000
   1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_

Test performs 100000 iterations master with each slave.
1000000 scale number to distinguish between sequences if integers with
master & each slave.
3 - rank of process to cause fail (fail_rank)
10 - fail iteration. On iteration 10 process with rank 3 will cause divide
by zero exception.
1 logs/mpi_rcv_waitany_it_9/res_     defines log file.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120108/9a79afbf/attachment-0001.htm>
-------------- next part --------------
student1-eth1
-------------- next part --------------
student1-eth1
student2-eth1
student3-eth1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_rcv_waitany.cpp
Type: text/x-c++src
Size: 4521 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120108/9a79afbf/attachment-0003.cpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_send_async.cpp
Type: text/x-c++src
Size: 3462 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120108/9a79afbf/attachment-0004.cpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_send_sync.cpp
Type: text/x-c++src
Size: 3485 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120108/9a79afbf/attachment-0005.cpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_test_incl.h
Type: text/x-chdr
Size: 9860 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120108/9a79afbf/attachment-0001.h>


More information about the mpich-discuss mailing list