[mpich-discuss] Fault tolerance - not stable.

Darius Buntinas buntinas at mcs.anl.gov
Mon Jan 16 11:15:50 CST 2012


Can you print out the error?  E.g.:

        char errstr[MPI_MAX_ERROR_STRING];
        int len = 0;

        mpi_errno = MPI_Error_string(statuses[i].MPI_ERROR, errstr, &len);
        assert(mpi_errno == MPI_SUCCESS);
        printf("error string: %s\n", errstr);

-d


On Jan 12, 2012, at 3:26 AM, Anatoly G wrote:

> Ok.
> I"ll send it again.
> In file mpi_rcv_waitany_1__r18.log I get same print
> ==========================================
> W_MpiSendSync error :
> ------------------------------------------
> ErrorCode:    0x18C1AE0F
> ==========================================
> 63286 times. Original file size ~ 18 MB.
> 
> So I deleted a lot of these prints to bring file to suitable size.
> 
> Anatoly.
> 
> On Wed, Jan 11, 2012 at 7:58 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>> 
>> Anatoly,
>> 
>> I didn't get the attachments.  Can you try again?
>> 
>> -d
>> 
>> On Jan 11, 2012, at 5:03 AM, Anatoly G wrote:
>> 
>>> Hi Darius.
>>> Thank you for your's response.
>>> I changed code according to your's proposition.
>>> Results:
>>> Sometimes I get fail process as expected, and one more process fails unexpectedly.
>>> I attach code & logs.
>>> Execution command:
>>> mpiexec.hydra -genvall -disable-auto-cleanup -f machines_student.txt -n 24 -launcher=rsh mpi_rcv_waitany 100000 1000000 3 5 1 logs/mpi_rcv_waitany_1_
>>> 
>>>       • I run 24 processes on 3 computers.
>>>       • 100000 iterations
>>>       • Expected fail of process 3 on iteration 5.
>>>       • In addition I got fail of process 18. It's logs with it's errors is mpi_rcv_waitany_1__r18.log
>>> Can you please review again my code, and give me some tips to fix application.
>>> 
>>> Anatoly.
>>> 
>>> 
>>> On Tue, Jan 10, 2012 at 9:32 PM, Anatoly G <anatolyrishon at gmail.com> wrote:
>>> 
>>> 
>>> ---------- Forwarded message ----------
>>> From: Darius Buntinas <buntinas at mcs.anl.gov>
>>> Date: Tue, Jan 10, 2012 at 8:11 PM
>>> Subject: Re: [mpich-discuss] Fault tolerance - not stable.
>>> To: mpich-discuss at mcs.anl.gov
>>> 
>>> 
>>> 
>>> I took a look at mpi_rcv_waitany.cpp, and I found a couple of issues.  I'm not sure if this is the problem, but we should fix these first.
>>> 
>>> In Rcv_WaitAny(), in the while(true) loop, you do a waitany, but then you iterate over the requests and do a test.  I don't think this is what you want to do.  When waitany returns, mRcvRequests[slaveIdx] will be set to MPI_REQUEST_NULL, so the subsequent test will return MPI_SUCCESS, and you may not register a failure.
>>> 
>>> Also, if all requests have previously completed, and you call waitany, then it will return and set slaveIdx to MPI_UNDEFINED, so we need to consider that case.
>>> 
>>> Another issue is that you post a receive for the slave after it completes, but never wait on that request.  This is not allowed in MPI (but you can probably get away with this most of the time).
>>> 
>>> I _think_ what you want to do is this:
>>> 
>>>  while(mSlavesFinished < mSlaves) {
>>>        retErr = MPI_Waitany(mRcvRequests.size(), &*mRcvRequests.begin(), &slaveIdx, &status);
>>>        slaveRank = slaveIdx + 1;
>>> 
>>>        if  (retErr != MPI_SUCCESS) {
>>>            char Msg[256];
>>>            sprintf(Msg, "From rank %d, fail - request deallocated", slaveRank);
>>>            handleMPIerror(mFpLog, Msg, retErr, &status);
>>>            mRcvRequests[slaveIdx] = MPI_REQUEST_NULL;
>>>            mIsSlaveLives[slaveIdx] = 0;
>>>            ++mSlavesFinished;
>>>            continue;
>>>        }
>>> 
>>>        /* if all requests have been completed, we should have exited the loop already */
>>>        assert(slaveIdx != MPI_UNDEFINED);
>>>        /* if the slave is dead, we should not be able to receive a message from it */
>>>        assert(mIsSlaveLives[slaveIdx});
>>> 
>>>        ++mSlavesRcvIters[slaveIdx];
>>>        if(mSlavesRcvIters[slaveIdx] == nIters) {
>>>            ++mSlavesFinished;
>>>            fprintf(mFpLog, "\n\nFrom rank %d, Got number = %d\n ", slaveRank, mRcvNumsBuf[slaveIdx]);
>>>            fprintf(mFpLog, "Slave %d finished\n\n", slaveIdx+1);
>>>        } else {
>>>            MPI_Irecv(&(mRcvNumsBuf[slaveIdx]), 1, MPI::INT, slaveRank, MPI::ANY_TAG, MPI_COMM_WORLD, &(mRcvRequests[slaveIdx]));
>>>        }
>>>    }
>>> 
>>> Give this a try and see how it works.
>>> 
>>> -d
>>> 
>>> 
>>> 
>>> On Jan 10, 2012, at 12:50 AM, Anatoly G wrote:
>>> 
>>>> Dear mpich-discuss,
>>>> I have a problem while using fault tolerance feature, in MPICH2 hydra process manager.
>>>> The results are not consistent, sometimes tests pass, sometimes stall.
>>>> If you executes command line written below in loop, after number of iterations, test stall.
>>>> Can you please help me with this problem?
>>>> 
>>>> There are 3 tests. All 3 tests have same model master with number of slaves. Communication operations are point to point.
>>>> 
>>>> Slave algorithm is same for all 3 tests.
>>>> for N times:
>>>>     MPI_Send integer to master.
>>>>     if IterI (parameter) && rank= fail_rank
>>>>         cause divide by zero exception. (A = 5.0; B = 0.0;  C = A / B;)
>>>> MPI_Recv(master)
>>>> 
>>>> Master algorithm Test1 (mpi_send_rcv_waitany.cpp) :
>>>>       • For each slave call MPI_Irecv
>>>>       • while not got N messages from each slave continue
>>>>       •       MPI_Waitany(slaveIdx)
>>>>       •       if slaveIdx alive
>>>>       •          MPI_Irecv(slaveIdx)
>>>>       •       else
>>>>       •          Mark it as finished.
>>>>       • MPI_Send to all slaves .
>>>> 
>>>> Master algorithm Test2 (mpi_send_sync.cpp) :
>>>>       • slave = first slave
>>>>       • while not got N messages from each slave continue
>>>>       •       MPI_Recv(slave)
>>>>       •       if slaveIdx alive
>>>>       •          pass to next live slave
>>>>       •       else
>>>>       •          Mark it as finished.
>>>>       • MPI_Send to all slaves .
>>>> 
>>>> Master algorithm Test3 (mpi_send_async.cpp) :
>>>> Same as test2, but instead of MPI_Recv, I use MPI_Irecv + MPI_Wait
>>>> 
>>>> When test stall, I connect debugger to master process.
>>>> Process stall in MPI_recv, or MPI_Irecv.
>>>> I think, stall caused by following sequence:
>>>>       • Master receives integer from slave.
>>>>       • Tests slave - it's Ok.
>>>>       • Slave failes
>>>>       • master try to perform MPI_Irecv or MPI_Recv on failed slave.
>>>> Problem happens on cluster (student_machines.txt) & on single machine (machine_student1.txt)
>>>> 
>>>> Execution lines:
>>>>       • /space/local/hydra/bin/mpiexec.hydra  -genvall  -disable-auto-cleanup  -f machine_student1.txt  -n 8  -launcher=rsh mpi_rcv_waitany 100000 1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_
>>>>       • /space/local/hydra/bin/mpiexec.hydra  -genvall  -disable-auto-cleanup  -f student_machines.txt  -n 12  -launcher=rsh mpi_rcv_waitany 100000 1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_
>>>> Test performs 100000 iterations master with each slave.
>>>> 1000000 scale number to distinguish between sequences if integers with master & each slave.
>>>> 3 - rank of process to cause fail (fail_rank)
>>>> 10 - fail iteration. On iteration 10 process with rank 3 will cause divide by zero exception.
>>>> 1 logs/mpi_rcv_waitany_it_9/res_     defines log file.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> <machine_student1.txt><machines_student.txt><mpi_rcv_waitany.cpp><mpi_send_async.cpp><mpi_send_sync.cpp><mpi_test_incl.h>_______________________________________________
>>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> <mpi_test_incl.h><mpi_rcv_waitany.cpp><mpi_rcv_waitany_1__r0.log><mpi_rcv_waitany_1__r18.log>_______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list