[mpich-discuss] Fault tolerance - not stable.

Wed Jan 11 07:03:40 CST 2012

>
> Hi Darius.
> Thank you for your's response.
> I changed code according to your's proposition.
> Results:
> Sometimes I get fail process as expected, and one more process fails
> unexpectedly.
> I attach code & logs.
> Execution command:
> mpiexec.hydra -genvall -disable-auto-cleanup -f machines_student.txt -n 24
> -launcher=rsh mpi_rcv_waitany 100000 1000000 3 5 1 logs/mpi_rcv_waitany_1_
>
>
>    - I run 24 processes on 3 computers.
>    - 100000 iterations
>    - Expected fail of process 3 on iteration 5.
>    - In addition I got fail of *process 18*. It's logs with it's errors
>    is mpi_rcv_waitany_1__r18.log
>
> Can you please review again my code, and give me some tips to fix
> application.
>
> Anatoly.
>
>
> On Tue, Jan 10, 2012 at 9:32 PM, Anatoly G <anatolyrishon at gmail.com>wrote:
>
>>
>>
>> ---------- Forwarded message ----------
>> From: Darius Buntinas <buntinas at mcs.anl.gov>
>> Date: Tue, Jan 10, 2012 at 8:11 PM
>> Subject: Re: [mpich-discuss] Fault tolerance - not stable.
>> To: mpich-discuss at mcs.anl.gov
>>
>>
>>
>> I took a look at mpi_rcv_waitany.cpp, and I found a couple of issues.
>>  I'm not sure if this is the problem, but we should fix these first.
>>
>> In Rcv_WaitAny(), in the while(true) loop, you do a waitany, but then you
>> iterate over the requests and do a test.  I don't think this is what you
>> want to do.  When waitany returns, mRcvRequests[slaveIdx] will be set to
>> MPI_REQUEST_NULL, so the subsequent test will return MPI_SUCCESS, and you
>> may not register a failure.
>>
>> Also, if all requests have previously completed, and you call waitany,
>> then it will return and set slaveIdx to MPI_UNDEFINED, so we need to
>> consider that case.
>>
>> Another issue is that you post a receive for the slave after it
>> completes, but never wait on that request.  This is not allowed in MPI (but
>> you can probably get away with this most of the time).
>>
>> I _think_ what you want to do is this:
>>
>>  while(mSlavesFinished < mSlaves) {
>>        retErr = MPI_Waitany(mRcvRequests.size(), &*mRcvRequests.begin(),
>> &slaveIdx, &status);
>>        slaveRank = slaveIdx + 1;
>>
>>        if  (retErr != MPI_SUCCESS) {
>>            char Msg[256];
>>            sprintf(Msg, "From rank %d, fail - request deallocated",
>> slaveRank);
>>            handleMPIerror(mFpLog, Msg, retErr, &status);
>>            mRcvRequests[slaveIdx] = MPI_REQUEST_NULL;
>>            mIsSlaveLives[slaveIdx] = 0;
>>            ++mSlavesFinished;
>>            continue;
>>        }
>>
>>        /* if all requests have been completed, we should have exited the
>> loop already */
>>        assert(slaveIdx != MPI_UNDEFINED);
>>        /* if the slave is dead, we should not be able to receive a
>> message from it */
>>        assert(mIsSlaveLives[slaveIdx});
>>
>>        ++mSlavesRcvIters[slaveIdx];
>>        if(mSlavesRcvIters[slaveIdx] == nIters) {
>>            ++mSlavesFinished;
>>            fprintf(mFpLog, "\n\nFrom rank %d, Got number = %d\n ",
>> slaveRank, mRcvNumsBuf[slaveIdx]);
>>            fprintf(mFpLog, "Slave %d finished\n\n", slaveIdx+1);
>>        } else {
>>            MPI_Irecv(&(mRcvNumsBuf[slaveIdx]), 1, MPI::INT, slaveRank,
>> MPI::ANY_TAG, MPI_COMM_WORLD, &(mRcvRequests[slaveIdx]));
>>        }
>>    }
>>
>> Give this a try and see how it works.
>>
>> -d
>>
>>
>>
>> On Jan 10, 2012, at 12:50 AM, Anatoly G wrote:
>>
>> > Dear mpich-discuss,
>> > I have a problem while using fault tolerance feature, in MPICH2 hydra
>> process manager.
>> > The results are not consistent, sometimes tests pass, sometimes stall.
>> > If you executes command line written below in loop, after number of
>> iterations, test stall.
>> > Can you please help me with this problem?
>> >
>> > There are 3 tests. All 3 tests have same model master with number of
>> slaves. Communication operations are point to point.
>> >
>> > Slave algorithm is same for all 3 tests.
>> > for N times:
>> >     MPI_Send integer to master.
>> >     if IterI (parameter) && rank= fail_rank
>> >         cause divide by zero exception. (A = 5.0; B = 0.0;  C = A / B;)
>> > MPI_Recv(master)
>> >
>> > Master algorithm Test1 (mpi_send_rcv_waitany.cpp) :
>> >       • For each slave call MPI_Irecv
>> >       • while not got N messages from each slave continue
>> >       •       MPI_Waitany(slaveIdx)
>> >       •       if slaveIdx alive
>> >       •          MPI_Irecv(slaveIdx)
>> >       •       else
>> >       •          Mark it as finished.
>> >       • MPI_Send to all slaves .
>> >
>> > Master algorithm Test2 (mpi_send_sync.cpp) :
>> >       • slave = first slave
>> >       • while not got N messages from each slave continue
>> >       •       MPI_Recv(slave)
>> >       •       if slaveIdx alive
>> >       •          pass to next live slave
>> >       •       else
>> >       •          Mark it as finished.
>> >       • MPI_Send to all slaves .
>> >
>> > Master algorithm Test3 (mpi_send_async.cpp) :
>> > Same as test2, but instead of MPI_Recv, I use MPI_Irecv + MPI_Wait
>> >
>> > When test stall, I connect debugger to master process.
>> > Process stall in MPI_recv, or MPI_Irecv.
>> > I think, stall caused by following sequence:
>> >       • Master receives integer from slave.
>> >       • Tests slave - it's Ok.
>> >       • Slave failes
>> >       • master try to perform MPI_Irecv or MPI_Recv on failed slave.
>> > Problem happens on cluster (student_machines.txt) & on single machine
>> (machine_student1.txt)
>> >
>> > Execution lines:
>> >       • /space/local/hydra/bin/mpiexec.hydra  -genvall
>>  -disable-auto-cleanup  -f machine_student1.txt  -n 8  -launcher=rsh
>> mpi_rcv_waitany 100000 1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_
>> >       • /space/local/hydra/bin/mpiexec.hydra  -genvall
>>  -disable-auto-cleanup  -f student_machines.txt  -n 12  -launcher=rsh
>> mpi_rcv_waitany 100000 1000000 3 10 1 logs/mpi_rcv_waitany_it_9/res_
>> > Test performs 100000 iterations master with each slave.
>> > 1000000 scale number to distinguish between sequences if integers with
>> master & each slave.
>> > 3 - rank of process to cause fail (fail_rank)
>> > 10 - fail iteration. On iteration 10 process with rank 3 will cause
>> divide by zero exception.
>> > 1 logs/mpi_rcv_waitany_it_9/res_     defines log file.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> <machine_student1.txt><machines_student.txt><mpi_rcv_waitany.cpp><mpi_send_async.cpp><mpi_send_sync.cpp><mpi_test_incl.h>_______________________________________________
>> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> > To manage subscription options or unsubscribe:
>> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120111/21b4e2ef/attachment-0001.htm>