[mpich-discuss] Fault tolerance - not stable.

Darius Buntinas buntinas at mcs.anl.gov
Tue Jan 10 12:11:34 CST 2012


I took a look at mpi_rcv_waitany.cpp, and I found a couple of issues.  I'm not sure if this is the problem, but we should fix these first.

In Rcv_WaitAny(), in the while(true) loop, you do a waitany, but then you iterate over the requests and do a test.  I don't think this is what you want to do.  When waitany returns, mRcvRequests[slaveIdx] will be set to MPI_REQUEST_NULL, so the subsequent test will return MPI_SUCCESS, and you may not register a failure.

Also, if all requests have previously completed, and you call waitany, then it will return and set slaveIdx to MPI_UNDEFINED, so we need to consider that case.

Another issue is that you post a receive for the slave after it completes, but never wait on that request.  This is not allowed in MPI (but you can probably get away with this most of the time).

I _think_ what you want to do is this:

  while(mSlavesFinished < mSlaves) {
        retErr = MPI_Waitany(mRcvRequests.size(), &*mRcvRequests.begin(), &slaveIdx, &status);
        slaveRank = slaveIdx + 1;

        if  (retErr != MPI_SUCCESS) {
            char Msg[256];
            sprintf(Msg, "From rank %d, fail - request deallocated", slaveRank);
            handleMPIerror(mFpLog, Msg, retErr, &status);
            mRcvRequests[slaveIdx] = MPI_REQUEST_NULL;
            mIsSlaveLives[slaveIdx] = 0;
            ++mSlavesFinished;
            continue;
        }

        /* if all requests have been completed, we should have exited the loop already */
        assert(slaveIdx != MPI_UNDEFINED);
        /* if the slave is dead, we should not be able to receive a message from it */
        assert(mIsSlaveLives[slaveIdx});

        ++mSlavesRcvIters[slaveIdx];
        if(mSlavesRcvIters[slaveIdx] == nIters) {
            ++mSlavesFinished;
            fprintf(mFpLog, "\n\nFrom rank %d, Got number = %d\n ", slaveRank, mRcvNumsBuf[slaveIdx]);
            fprintf(mFpLog, "Slave %d finished\n\n", slaveIdx+1);
        } else {
            MPI_Irecv(&(mRcvNumsBuf[slaveIdx]), 1, MPI::INT, slaveRank, MPI::ANY_TAG, MPI_COMM_WORLD, &(mRcvRequests[slaveIdx]));
        }
    }

Give this a try and see how it works.

-d



On Jan 10, 2012, at 12:50 AM, Anatoly G wrote:

> Dear mpich-discuss,
> I have a problem while using fault tolerance feature, in MPICH2 hydra process manager.
> The results are not consistent, sometimes tests pass, sometimes stall.
> If you executes command line written below in loop, after number of iterations, test stall.
> Can you please help me with this problem? 
> 
> There are 3 tests. All 3 tests have same model master with number of slaves. Communication operations are point to point.
> 
> Slave algorithm is same for all 3 tests.
> for N times:
>     MPI_Send integer to master.
>     if IterI (parameter) && rank= fail_rank
>         cause divide by zero exception. (A = 5.0; B = 0.0;  C = A / B;)
> MPI_Recv(master)
> 
> Master algorithm Test1 (mpi_send_rcv_waitany.cpp) :
> 	• For each slave call MPI_Irecv
> 	• while not got N messages from each slave continue
> 	•       MPI_Waitany(slaveIdx)
> 	•       if slaveIdx alive
> 	•          MPI_Irecv(slaveIdx)
> 	•       else
> 	•          Mark it as finished.
> 	• MPI_Send to all slaves .
> 
> Master algorithm Test2 (mpi_send_sync.cpp) :
> 	• slave = first slave
> 	• while not got N messages from each slave continue
> 	•       MPI_Recv(slave)
> 	•       if slaveIdx alive
> 	•          pass to next live slave
> 	•       else
> 	•          Mark it as finished.
> 	• MPI_Send to all slaves .
> 
> Master algorithm Test3 (mpi_send_async.cpp) :   
> Same as test2, but instead of MPI_Recv, I use MPI_Irecv + MPI_Wait
> 
> When test stall, I connect debugger to master process.
> Process stall in MPI_recv, or MPI_Irecv.
> I think, stall caused by following sequence:
> 	• Master receives integer from slave.
> 	• Tests slave - it's Ok.
> 	• Slave failes
> 	• master try to perform MPI_Irecv or MPI_Recv on failed slave.
> Problem happens on cluster (student_machines.txt) & on single machine (machine_student1.txt)
> 
> Execution lines:
> 	• /space/local/hydra/bin/mpiexec.hydra  -genvall  -disable-auto-cleanup  -f machine_student1.txt  -n 8  -launcher=rsh mpi_rcv_waitany 100000 1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_
> 	• /space/local/hydra/bin/mpiexec.hydra  -genvall  -disable-auto-cleanup  -f student_machines.txt  -n 12  -launcher=rsh mpi_rcv_waitany 100000 1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_
> Test performs 100000 iterations master with each slave.
> 1000000 scale number to distinguish between sequences if integers with master & each slave.
> 3 - rank of process to cause fail (fail_rank)
> 10 - fail iteration. On iteration 10 process with rank 3 will cause divide by zero exception.
> 1 logs/mpi_rcv_waitany_it_9/res_     defines log file.
> 
> 
> 
> 
> 
> 
> 
> <machine_student1.txt><machines_student.txt><mpi_rcv_waitany.cpp><mpi_send_async.cpp><mpi_send_sync.cpp><mpi_test_incl.h>_______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list