[petsc-dev] Bug in MatShift_MPIAIJ ?

Wed Oct 21 09:56:46 CDT 2015

  I similarly saw the same behavior in the debugger that you reported and was mystified. 
  This is why I called it a nasty bug. The setting of different nonew flag meant that one process called the Allreduce() at the end of MatAssemblyEnd_MPIAIJ()

if ((!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) || !((Mat_SeqAIJ*)(aij->A->data))->nonew) {
    PetscObjectState state = aij->A->nonzerostate + aij->B->nonzerostate;
    ierr = MPI_Allreduce(&state,&mat->nonzerostate,1,MPIU_INT64,MPI_SUM,PetscObjectComm((PetscObject)mat));CHKERRQ(ierr);
  }

but the other process skipped this call. Thus the other process got to the MPI_Allreduce() in PetscValidLogicalCollectiveEnum() and exchanged data between the two MPI_Allreduce() that serve very different purposes. Since MPI_Allreduce() doesn't have any concept of tags if one process skips an Allreduce call then the next Allreduce call gets matched with it. 

  The way I finally found the bug was to put a break point in MPI_Allreduce() and run the program until the two processes were calling Allreduce() from different places.

  Barry

> On Oct 21, 2015, at 8:04 AM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
> 
> Thanks Barry! :)
> 
> another question: while trying to understand this, I used the "-on_error_attach_debugger ddd" which worked, BUT the line where it was breaking all seemed ok to me.  I mean, the value of "op" variable tested by at line petsc-3.6.2/src/mat/interface/matrix.c:5264 :
> 
> 5256 PetscErrorCode  MatSetOption(Mat mat,MatOption op,PetscBool flg)
> 5257 {
> 5258   PetscErrorCode ierr;
> 5259
> 5260   PetscFunctionBegin;
> 5261   PetscValidHeaderSpecific(mat,MAT_CLASSID,1);
> 5262   PetscValidType(mat,1);
> 5263   if (op > 0) {
> 5264     PetscValidLogicalCollectiveEnum(mat,op,2);
> ...
> 
> was the same on all (2) processes, but printing the values of variables "op", "b1" and "b2" used in the macro PetscValidLogicalCollectiveEnum gave me:
> 
> =====
> process rank 1:
> =====
> (gdb) print b2[0]
> $1 = 5
> (gdb) print b2[1]
> $2 = 17
> 
> and for b1:
> 
> (gdb) print b1[1]
> $3 = 17
> (gdb) print b1[0]
> $4 = -17
> 
> and:
> (gdb) print (int)op
> $7 = 17
> 
> =====
> process rank 0:
> =====
> (gdb) print b2[0]
> $1 = 5
> (gdb) print b2[1]
> $2 = 17
> (gdb) print b1[0]
> $3 = -17
> (gdb) print b1[1]
> $4 = 17
> (gdb) print (int)(op)
> $5 = 17
> 
> So local values of "op" and "b1" are all correct, but there is an invalid value resulting from the "MPI_Allreduce" ????
> 
> I am not quite a PETSc expert, but I would have expected that the debugger started at that point would have gave me a chance to understand what is happening...  is there something to do with that verification to help it help users like me debugging more easily?
> 
> Thanks anyway!
> 
> Eric
> 
> On 20/10/15 10:47 PM, Barry Smith wrote:
>> 
>>   Eric,
>> 
>>    Thanks for the test case. I have determined the problem, it is a nasty bug caused by overly convoluted code.
>> 
>> The MatSeqAIJSetPreallocation() is there because if the matrix had been assembled but had no values in it the MatShift_Basic() took forever since
>> a new malloc needed to be done for each local row. The problem is that MatSeqAIJSetPreallocation changed the value of the aij->nonew flag of that sequential object, BUT MatAssemblyEnd_MPIA() assumed that the value of this flag was identical on all processes. In your case since aij->nz = 0 on your matrix with no local rows the value of nonew was changed on one process but not on others triggering disaster in the MatAssemblyEnd_MPIA().
>> 
>> This is now fixed in the maint, master and next branches and will be in the next patch release. I have also attached the patch to this email.
>> 
>>   Barry
>> 
>