[mpich-discuss] Fault tolerance problem.
Anatoly G
anatolyrishon at gmail.com
Wed Dec 28 05:25:18 CST 2011
Dear mpich-discuss,
I have a problem while using fault tolerance feature, in MPICH2 hydra
process manager.
The results are not consistent, sometimes tests pass, sometimes stall.
If you executes command line written below in loop, after number of
iterations, test stall.
Can you please help me with this problem?
There are 3 tests. All 3 tests have same model master with number of
slaves. Communication operations are point to point.
*
*
*Slave algorithm is same for all 3 tests.*
for N times:
MPI_Send integer to master.
if IterI (parameter) && rank= fail_rank
cause divide by zero exception. (A = 5.0; B = 0.0; C = A / B;)
MPI_Recv(master)
*
*
*Master algorithm Test1 (mpi_send_rcv_waitany.cpp) :*
- For each slave call MPI_Irecv
- while not got N messages from each slave continue
- MPI_Waitany(slaveIdx)
- if slaveIdx alive
- MPI_Irecv(slaveIdx)
- else
- Mark it as finished.
- MPI_Send to all slaves .
*Master algorithm Test2 (mpi_send_sync.cpp) :*
- slave = first slave
- while not got N messages from each slave continue
- MPI_Recv(slave)
- if slaveIdx alive
- pass to next live slave
- else
- Mark it as finished.
- MPI_Send to all slaves .
*Master algorithm Test3 (mpi_send_async.cpp) : *
Same as test2, but instead of MPI_Recv, I use MPI_Irecv + MPI_Wait
When test stall, I connect debugger to master process.
Process stall in MPI_recv, or MPI_Irecv.
I think, stall caused by following sequence:
- Master receives integer from slave.
- Tests slave - it's Ok.
- Slave failes
- master try to perform MPI_Irecv or MPI_Recv on failed slave.
Problem happens on cluster (student_machines.txt) & on single machine
(machine_student1.txt)
Execution lines:
- /space/local/hydra/bin/mpiexec.hydra -genvall -disable-auto-cleanup
-f machine_student1.txt -n 8 -launcher=rsh mpi_rcv_waitany 100000
1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_
- /space/local/hydra/bin/mpiexec.hydra -genvall -disable-auto-cleanup
-f student_machines.txt -n 12 -launcher=rsh mpi_rcv_waitany 100000
1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_
Test performs 100000 iterations master with each slave.
1000000 scale number to distinguish between sequences if integers with
master & each slave.
3 - rank of process to cause fail (fail_rank)
10 - fail iteration. On iteration 10 process with rank 3 will cause divide
by zero exception.
1 logs/mpi_rcv_waitany_it_9/res_ defines log file.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111228/63b1c202/attachment-0001.htm>
-------------- next part --------------
student1-eth1
-------------- next part --------------
student1-eth1
student2-eth1
student3-eth1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_rcv_waitany.cpp
Type: text/x-c++src
Size: 4521 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111228/63b1c202/attachment-0003.cpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_send_async.cpp
Type: text/x-c++src
Size: 3462 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111228/63b1c202/attachment-0004.cpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_send_sync.cpp
Type: text/x-c++src
Size: 3485 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111228/63b1c202/attachment-0005.cpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_test_incl.h
Type: text/x-chdr
Size: 9860 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111228/63b1c202/attachment-0001.h>
More information about the mpich-discuss
mailing list