[petsc-dev] Bug in MatShift_MPIAIJ ?

Wed Oct 21 10:28:58 CDT 2015

On 21/10/15 10:56 AM, Barry Smith wrote:
>
>    I similarly saw the same behavior in the debugger that you reported and was mystified.
>    This is why I called it a nasty bug. The setting of different nonew flag meant that one process called the Allreduce() at the end of MatAssemblyEnd_MPIAIJ()
>
> if ((!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) || !((Mat_SeqAIJ*)(aij->A->data))->nonew) {
>      PetscObjectState state = aij->A->nonzerostate + aij->B->nonzerostate;
>      ierr = MPI_Allreduce(&state,&mat->nonzerostate,1,MPIU_INT64,MPI_SUM,PetscObjectComm((PetscObject)mat));CHKERRQ(ierr);
>    }
>
> but the other process skipped this call. Thus the other process got to the MPI_Allreduce() in PetscValidLogicalCollectiveEnum() and exchanged data between the two MPI_Allreduce() that serve very different purposes. Since MPI_Allreduce() doesn't have any concept of tags if one process skips an Allreduce call then the next Allreduce call gets matched with it.

ok!  I understand now...

then, would wrapping (in debug mode), these calls with MPI_Barrier 
before/after, would have prevented to mislead the debugging person?  In 
our code, we do control that all processes make the same collective mpi 
calls with a kind-of barrier (in debug mode) which communicates the line 
number and the file name (transformed into a int value) so that a 
process is blocked at the first wrongly matched mpi call...

>
>    The way I finally found the bug was to put a break point in MPI_Allreduce() and run the program until the two processes were calling Allreduce() from different places.

okay...wow... you modified the mpi code?  I was asking myself if there 
is a runtime or compilation option for openmpi/mpich to do this for all 
relevant collective mpi calls?

Anyway, thanks again!

Eric