[petsc-dev] Bug in MatShift_MPIAIJ ?
Barry Smith
bsmith at mcs.anl.gov
Sat Aug 15 12:34:23 CDT 2015
I have merged the bug fix for both into master.
I have merged the bug fix for MatShift() into maint. Due to Jed's concern about "breaking the ABI" in the release I have not merged the error checker fix for the possible deadlock with ->nonew having different values on different processes into maint.
Barry
> On Aug 14, 2015, at 6:16 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>
> Patrick
>
> Thanks for reporting these two bugs.
>
>> On Aug 12, 2015, at 9:40 AM, Patrick Lacasse <patrick.m.lacasse at gmail.com> wrote:
>>
>> I have a dead lock in MatAssemblyEnd_MPIAIJ.
>> It uses nonew as collective :
>> if (!((Mat_SeqAIJ*)aij->B->data)->nonew) {
>> ierr = MPI_Allreduce( ...
>>
>> this is a good optimization for me, however this value is set by
>> MatSetOptions(,MAT_NEW_NONZERO_ALLOCATION_ERR,)
>> and this enum is negative
>> MAT_NEW_NONZERO_ALLOCATION_ERR = -2
>> thus it is not asserted as collective by MatSetOption.
>>
>> Some of my procs have nonew==0 and one has nonew==-2.
>> I'm not sure why adding new nonzero should be an error only for some procs?
>> I suggest that this error flag should be set collectively, so changing the value of the enum to positive (patch 0001).
>
> Fixed in branches barry/fix-nonew-notcollective/maint and commit 60bf598 not yet merged into next because I get a conflict with something Jed put in a long time ago that has no branch (Jed see other email).
>
>
> MAT_NEW_NONZERO_LOCATION_ERR and MAT_NEW_NONZERO_ALLOCATION_ERR must be collective because they change the value of ->nonew which is
> used to decide if some MPI_Allreduce() are called. Thus with different values the code could hang
>
> Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>
>
>>
>> My reel problem was caused by MatShift_MPIAIJ and the way it determine if the matrix need to be preallocated :
>> if (!aij->nz && !bij->nz)
>> the results can be true for some procs (with no local lines)
>> and false for other procs.
>> I suggest to use Y->preallocated instead (patch 0002).
>
> Fixed in branches barry/fix-matshift/maint and next and commit 6f33a89 will merge into maint and master after testing.
>
> MatShift_MPI/SeqXAIJ() could hang if some processes had no entries on a process while others had entries
> because some processes would attempt a parallel preallocation and the others would not.
>
> Fixed by first checking if no preallocation was done, and if not doing. Otherwise preallocation is only done
> if approprate by each process on the diagonal block portion of the matrix, thus not requiring all processes
> that share the matrix to call the parallel preallocation routine
>
> Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>
>
>>
>
>> thanks,
>>
>> Patrick Lacasse
>>
>>
>>
>>
>> <0001-MAT_NEW_NONZERO_-DE-LOCATION_ERR-are-collective.patch><0002-Dead-lock-bug-in-MatShift_MPIAIJ.patch>
>
More information about the petsc-dev
mailing list