[petsc-dev] Bug in MatShift_MPIAIJ ?

Barry Smith bsmith at mcs.anl.gov
Sat Aug 15 12:34:23 CDT 2015


  I have merged the bug fix for both into master. 

  I have merged the bug fix for MatShift() into maint. Due to Jed's concern about "breaking the ABI" in the release I have not merged the error checker fix for the possible deadlock with ->nonew having different values on different processes into maint.

   Barry

> On Aug 14, 2015, at 6:16 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
> 
>   Patrick
> 
>     Thanks for reporting these two bugs.
> 
>> On Aug 12, 2015, at 9:40 AM, Patrick Lacasse <patrick.m.lacasse at gmail.com> wrote:
>> 
>> I have a dead lock in MatAssemblyEnd_MPIAIJ.
>> It uses nonew as collective :
>> if (!((Mat_SeqAIJ*)aij->B->data)->nonew) {
>>    ierr = MPI_Allreduce( ...
>> 
>> this is a good optimization for me, however this value is set by
>> MatSetOptions(,MAT_NEW_NONZERO_ALLOCATION_ERR,)
>> and this enum is negative
>> MAT_NEW_NONZERO_ALLOCATION_ERR = -2
>> thus it is not asserted as collective by MatSetOption.
>> 
>> Some of my procs have nonew==0 and one has nonew==-2.
>> I'm not sure why adding new nonzero should be an error only for some procs?
>> I suggest that this error flag should be set collectively, so changing the value of the enum to positive (patch 0001).
> 
> Fixed in branches barry/fix-nonew-notcollective/maint and commit 60bf598 not yet merged into next because I get a conflict with something Jed put in a long time ago that has no branch (Jed see other email).
> 
> 
> MAT_NEW_NONZERO_LOCATION_ERR and MAT_NEW_NONZERO_ALLOCATION_ERR  must be collective because they change the value of ->nonew which is
> used to decide if some MPI_Allreduce() are called. Thus with different values the code could hang
> 
> Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>
> 
>> 
>> My reel problem was caused by MatShift_MPIAIJ and the way it determine if the matrix need to be preallocated :
>> if (!aij->nz && !bij->nz)
>> the results can be true for some procs (with no local lines)
>> and false for other procs.
>> I suggest to use Y->preallocated instead (patch 0002).
> 
> Fixed in branches barry/fix-matshift/maint and next and commit 6f33a89  will merge into maint and master after testing.
> 
> MatShift_MPI/SeqXAIJ() could hang if some processes had no entries on a process while others had entries
> because some processes would attempt a parallel preallocation and the others would not.
> 
> Fixed by first checking if no preallocation was done, and if not doing. Otherwise preallocation is only done
> if approprate by each process on the diagonal block portion of the matrix, thus not requiring all processes
> that share the matrix to call the parallel preallocation routine                                                                                                   
> 
> Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>
> 
>> 
> 
>> thanks,
>> 
>> Patrick Lacasse
>> 
>> 
>> 
>> 
>> <0001-MAT_NEW_NONZERO_-DE-LOCATION_ERR-are-collective.patch><0002-Dead-lock-bug-in-MatShift_MPIAIJ.patch>
> 




More information about the petsc-dev mailing list