[petsc-dev] Bug in MatShift_MPIAIJ ?
Barry Smith
bsmith at mcs.anl.gov
Wed Oct 21 09:56:46 CDT 2015
I similarly saw the same behavior in the debugger that you reported and was mystified.
This is why I called it a nasty bug. The setting of different nonew flag meant that one process called the Allreduce() at the end of MatAssemblyEnd_MPIAIJ()
if ((!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) || !((Mat_SeqAIJ*)(aij->A->data))->nonew) {
PetscObjectState state = aij->A->nonzerostate + aij->B->nonzerostate;
ierr = MPI_Allreduce(&state,&mat->nonzerostate,1,MPIU_INT64,MPI_SUM,PetscObjectComm((PetscObject)mat));CHKERRQ(ierr);
}
but the other process skipped this call. Thus the other process got to the MPI_Allreduce() in PetscValidLogicalCollectiveEnum() and exchanged data between the two MPI_Allreduce() that serve very different purposes. Since MPI_Allreduce() doesn't have any concept of tags if one process skips an Allreduce call then the next Allreduce call gets matched with it.
The way I finally found the bug was to put a break point in MPI_Allreduce() and run the program until the two processes were calling Allreduce() from different places.
Barry
> On Oct 21, 2015, at 8:04 AM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
>
> Thanks Barry! :)
>
> another question: while trying to understand this, I used the "-on_error_attach_debugger ddd" which worked, BUT the line where it was breaking all seemed ok to me. I mean, the value of "op" variable tested by at line petsc-3.6.2/src/mat/interface/matrix.c:5264 :
>
> 5256 PetscErrorCode MatSetOption(Mat mat,MatOption op,PetscBool flg)
> 5257 {
> 5258 PetscErrorCode ierr;
> 5259
> 5260 PetscFunctionBegin;
> 5261 PetscValidHeaderSpecific(mat,MAT_CLASSID,1);
> 5262 PetscValidType(mat,1);
> 5263 if (op > 0) {
> 5264 PetscValidLogicalCollectiveEnum(mat,op,2);
> ...
>
> was the same on all (2) processes, but printing the values of variables "op", "b1" and "b2" used in the macro PetscValidLogicalCollectiveEnum gave me:
>
> =====
> process rank 1:
> =====
> (gdb) print b2[0]
> $1 = 5
> (gdb) print b2[1]
> $2 = 17
>
> and for b1:
>
> (gdb) print b1[1]
> $3 = 17
> (gdb) print b1[0]
> $4 = -17
>
> and:
> (gdb) print (int)op
> $7 = 17
>
> =====
> process rank 0:
> =====
> (gdb) print b2[0]
> $1 = 5
> (gdb) print b2[1]
> $2 = 17
> (gdb) print b1[0]
> $3 = -17
> (gdb) print b1[1]
> $4 = 17
> (gdb) print (int)(op)
> $5 = 17
>
> So local values of "op" and "b1" are all correct, but there is an invalid value resulting from the "MPI_Allreduce" ????
>
> I am not quite a PETSc expert, but I would have expected that the debugger started at that point would have gave me a chance to understand what is happening... is there something to do with that verification to help it help users like me debugging more easily?
>
> Thanks anyway!
>
> Eric
>
> On 20/10/15 10:47 PM, Barry Smith wrote:
>>
>> Eric,
>>
>> Thanks for the test case. I have determined the problem, it is a nasty bug caused by overly convoluted code.
>>
>> The MatSeqAIJSetPreallocation() is there because if the matrix had been assembled but had no values in it the MatShift_Basic() took forever since
>> a new malloc needed to be done for each local row. The problem is that MatSeqAIJSetPreallocation changed the value of the aij->nonew flag of that sequential object, BUT MatAssemblyEnd_MPIA() assumed that the value of this flag was identical on all processes. In your case since aij->nz = 0 on your matrix with no local rows the value of nonew was changed on one process but not on others triggering disaster in the MatAssemblyEnd_MPIA().
>>
>> This is now fixed in the maint, master and next branches and will be in the next patch release. I have also attached the patch to this email.
>>
>> Barry
>>
>
More information about the petsc-dev
mailing list