[petsc-dev] Bug in MatShift_MPIAIJ ?
Eric Chamberland
Eric.Chamberland at giref.ulaval.ca
Wed Oct 21 10:28:58 CDT 2015
On 21/10/15 10:56 AM, Barry Smith wrote:
>
> I similarly saw the same behavior in the debugger that you reported and was mystified.
> This is why I called it a nasty bug. The setting of different nonew flag meant that one process called the Allreduce() at the end of MatAssemblyEnd_MPIAIJ()
>
> if ((!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) || !((Mat_SeqAIJ*)(aij->A->data))->nonew) {
> PetscObjectState state = aij->A->nonzerostate + aij->B->nonzerostate;
> ierr = MPI_Allreduce(&state,&mat->nonzerostate,1,MPIU_INT64,MPI_SUM,PetscObjectComm((PetscObject)mat));CHKERRQ(ierr);
> }
>
> but the other process skipped this call. Thus the other process got to the MPI_Allreduce() in PetscValidLogicalCollectiveEnum() and exchanged data between the two MPI_Allreduce() that serve very different purposes. Since MPI_Allreduce() doesn't have any concept of tags if one process skips an Allreduce call then the next Allreduce call gets matched with it.
ok! I understand now...
then, would wrapping (in debug mode), these calls with MPI_Barrier
before/after, would have prevented to mislead the debugging person? In
our code, we do control that all processes make the same collective mpi
calls with a kind-of barrier (in debug mode) which communicates the line
number and the file name (transformed into a int value) so that a
process is blocked at the first wrongly matched mpi call...
>
> The way I finally found the bug was to put a break point in MPI_Allreduce() and run the program until the two processes were calling Allreduce() from different places.
okay...wow... you modified the mpi code? I was asking myself if there
is a runtime or compilation option for openmpi/mpich to do this for all
relevant collective mpi calls?
Anyway, thanks again!
Eric
More information about the petsc-dev
mailing list