[petsc-dev] Bug in MatShift_MPIAIJ ?

Thu Oct 22 00:14:35 CDT 2015

  I have added this in the branch barry/add-mpiu_allreduce and made a pull request

  Thanks!

> On Oct 21, 2015, at 11:18 AM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
> 
> On 21/10/15 11:36 AM, Barry Smith wrote:
>> 
>>> On Oct 21, 2015, at 10:28 AM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
>>> 
>>> On 21/10/15 10:56 AM, Barry Smith wrote:
>>>> 
>>>>   I similarly saw the same behavior in the debugger that you reported and was mystified.
>>>>   This is why I called it a nasty bug. The setting of different nonew flag meant that one process called the Allreduce() at the end of MatAssemblyEnd_MPIAIJ()
>>>> 
>>>> if ((!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) || !((Mat_SeqAIJ*)(aij->A->data))->nonew) {
>>>>     PetscObjectState state = aij->A->nonzerostate + aij->B->nonzerostate;
>>>>     ierr = MPI_Allreduce(&state,&mat->nonzerostate,1,MPIU_INT64,MPI_SUM,PetscObjectComm((PetscObject)mat));CHKERRQ(ierr);
>>>>   }
>>>> 
>>>> but the other process skipped this call. Thus the other process got to the MPI_Allreduce() in PetscValidLogicalCollectiveEnum() and exchanged data between the two MPI_Allreduce() that serve very different purposes. Since MPI_Allreduce() doesn't have any concept of tags if one process skips an Allreduce call then the next Allreduce call gets matched with it.
>>> 
>>> ok!  I understand now...
>>> 
>>> then, would wrapping (in debug mode), these calls with MPI_Barrier before/after, would have prevented to mislead the debugging person?  In our code, we do control that all processes make the same collective mpi calls with a kind-of barrier (in debug mode) which communicates the line number and the file name (transformed into a int value) so that a process is blocked at the first wrongly matched mpi call...
>> 
>>     This is a good idea. Do you have a C macro implementation you would be willing to share, it would be trivial to add to PETSc if you had something like
>> 
>> #if debug
>> #define MPIU_Allreduce(.....) macro that first checks function and line number
>> #else
>> #define MPIU_Allreduce MPI_Allreduce
>> 
> 
> Yes/no: it is really not coded like this for us and use our own MPI classes, but you mostly already have the mandatory required function which is PetscValidLogicalCollectiveInt... but the "ints" you check are:
> 
> [0] => __LINE__
> [1] => __FILE__ transformed into a int (we just sum the chars...)
> [2] => a static counter that is incremented (to catch that the very same line+file is called the very same number of time -- which can be an error that will not be caught if you compare only [0] and [1]).
> 
> So It could be something like this:
> 
> #if debug
> #define MPIU_Allreduce(.....) \
> PetscVerifyExecutionPoint(__LINE__,__FILE__,comm); \\
> MPI_Allreduce...
> #else
> #define MPIU_Allreduce MPI_Allreduce
> 
> with:
> 
> PetscVerifyExecutionPoint(__LINE__,__FILE__,comm)
> {
> static int lCount = 0;
> ++lCount;
> 
> /* better to communicate all 3 at the same time...*/
> SomewhatPetscValidLogicalCollectiveInt(__FILE__,comm);
> SomewhatPetscValidLogicalCollectiveInt(__LINE__,comm);
> SomewhatPetscValidLogicalCollectiveInt(lCount,comm);
> }
> 
> and all of this is not my idea, it is from Patrick... ;)
> 
>> 
>>> 
>>> 
>>>> 
>>>>   The way I finally found the bug was to put a break point in MPI_Allreduce() and run the program until the two processes were calling Allreduce() from different places.
>>> 
>>> okay...wow... you modified the mpi code?
>> 
>>    No, I just ran both processes in the debugger with the break point doing a cont each time until they did not match.
> 
> ok!  I forgot we can do this... ;)
> 
> Eric
>