[petsc-dev] Bug in MatShift_MPIAIJ ?

Barry Smith bsmith at mcs.anl.gov
Tue Oct 20 21:47:45 CDT 2015


  Eric,

   Thanks for the test case. I have determined the problem, it is a nasty bug caused by overly convoluted code. 

The MatSeqAIJSetPreallocation() is there because if the matrix had been assembled but had no values in it the MatShift_Basic() took forever since
a new malloc needed to be done for each local row. The problem is that MatSeqAIJSetPreallocation changed the value of the aij->nonew flag of that sequential object, BUT MatAssemblyEnd_MPIA() assumed that the value of this flag was identical on all processes. In your case since aij->nz = 0 on your matrix with no local rows the value of nonew was changed on one process but not on others triggering disaster in the MatAssemblyEnd_MPIA(). 

This is now fixed in the maint, master and next branches and will be in the next patch release. I have also attached the patch to this email.

  Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix-mat-shift-mpi-bug.patch
Type: application/octet-stream
Size: 1830 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20151020/2e51a895/attachment.obj>
-------------- next part --------------


It is not a great fix. We should overhaul the handling of the nonew options to not overload that flag with several distinct meanings.



#undef __FUNCT__
#define __FUNCT__ "MatShift_MPIAIJ"
PetscErrorCode MatShift_MPIAIJ(Mat Y,PetscScalar a)
{
  PetscErrorCode ierr;
  Mat_MPIAIJ     *maij = (Mat_MPIAIJ*)Y->data;
  Mat_SeqAIJ     *aij = (Mat_SeqAIJ*)maij->A->data;

  PetscFunctionBegin;
  if (!Y->preallocated) {
    ierr = MatMPIAIJSetPreallocation(Y,1,NULL,0,NULL);CHKERRQ(ierr);
  } else if (!aij->nz) {
    ierr = MatSeqAIJSetPreallocation(maij->A,1,NULL);CHKERRQ(ierr);
  }
  ierr = MatShift_Basic(Y,a);CHKERRQ(ierr);
  PetscFunctionReturn(0);
}


> On Oct 20, 2015, at 2:37 PM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
> 
> Hi,
> 
> I think I have made simple (valid) example.
> 
> Please have a look at the attachment.
> 
> Compiled as is, the 2nd process builds a matrix with no local lines, copy it the MatShift it so it shows the bug (with debugging=yes and -on_error_attach_debugger ddd).  You can comment the line:
> 
> #define SET_2nd_PROC_TO_HAVE_NO_LOCAL_LINES
> 
> to see that it runs fine if the 2nd process have local lines...
> 
> The "magic" happens only if I call MatConvert on the "original" (C) matrix...
> 
> Hope this helps!
> 
> Thanks,
> 
> Eric
> 
> On 08/10/15 04:18 PM, Barry Smith wrote:
>> 
>>> On Oct 8, 2015, at 2:50 PM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
>>> 
>>> On 08/10/15 03:30 PM, Satish Balay wrote:
>>>> The commented code is sequential code - and shouldn't make a
>>>> difference.
>>> 
>>> ...but it does!
>>> 
>>>> 
>>>> Perhaps your application has other issues.
>>> 
>>>> 
>>>> Can you verify if your code is valgrind clean?
>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>> 
>>> Yes it is valgrind clean.  We run every night the same regression tests with valgrind and check morning results since jun 27 2012.  Valgrind run over ~2000 tests (in our suite) and takes almost 13 hours to complete. It is costly, but believe me, we have very useful results with the reports since we do this check.
>>> 
>>> The point is that I do have "0 == aij->nz" because the matrix has 0 lines on one processor.  So why should it pass into  Mat*SetPreallocation?
>>> 
>>> I understand it will speed-up things to preallocated the diagonal if you have not preallocated it, but the criterion of (0 ==  aij->nz) is not right in the case you have no lines on one processor!
>>> 
>>> In other words, after the call to Mat*SetPreallocation, in that case, it must still have (0 == aij->nz) because there are no lines on the processor...
>> 
>>   Sure, but why would that cause a hang or any other problem?  Calling  preallocation on a zero row matrix should be harmless. We need a test case that demonstrates the problem so we can reproduce the problem and determine the fundamental cause.
>> 
>>   Barry
>> 
>>> 
>>> Thanks!
>>> 
>>> Eric
>>> 
> 
> <ex21_only_proc0_owns_rows.c>



More information about the petsc-dev mailing list