[petsc-dev] Bug in MatShift_MPIAIJ ?

Barry Smith bsmith at mcs.anl.gov
Fri Aug 14 18:16:39 CDT 2015


   Patrick

     Thanks for reporting these two bugs.

> On Aug 12, 2015, at 9:40 AM, Patrick Lacasse <patrick.m.lacasse at gmail.com> wrote:
> 
> I have a dead lock in MatAssemblyEnd_MPIAIJ.
> It uses nonew as collective :
> if (!((Mat_SeqAIJ*)aij->B->data)->nonew) {
>     ierr = MPI_Allreduce( ...
> 
> this is a good optimization for me, however this value is set by
> MatSetOptions(,MAT_NEW_NONZERO_ALLOCATION_ERR,)
> and this enum is negative
> MAT_NEW_NONZERO_ALLOCATION_ERR = -2
> thus it is not asserted as collective by MatSetOption.
> 
> Some of my procs have nonew==0 and one has nonew==-2.
> I'm not sure why adding new nonzero should be an error only for some procs?
> I suggest that this error flag should be set collectively, so changing the value of the enum to positive (patch 0001).

Fixed in branches barry/fix-nonew-notcollective/maint and commit 60bf598 not yet merged into next because I get a conflict with something Jed put in a long time ago that has no branch (Jed see other email).


MAT_NEW_NONZERO_LOCATION_ERR and MAT_NEW_NONZERO_ALLOCATION_ERR  must be collective because they change the value of ->nonew which is
used to decide if some MPI_Allreduce() are called. Thus with different values the code could hang

Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>

> 
> My reel problem was caused by MatShift_MPIAIJ and the way it determine if the matrix need to be preallocated :
> if (!aij->nz && !bij->nz)
> the results can be true for some procs (with no local lines)
> and false for other procs.
> I suggest to use Y->preallocated instead (patch 0002).

Fixed in branches barry/fix-matshift/maint and next and commit 6f33a89  will merge into maint and master after testing.

MatShift_MPI/SeqXAIJ() could hang if some processes had no entries on a process while others had entries
because some processes would attempt a parallel preallocation and the others would not.

Fixed by first checking if no preallocation was done, and if not doing. Otherwise preallocation is only done
if approprate by each process on the diagonal block portion of the matrix, thus not requiring all processes
that share the matrix to call the parallel preallocation routine                                                                                                   

Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>

> 

> thanks,
> 
> Patrick Lacasse
> 
> 
> 
> 
> <0001-MAT_NEW_NONZERO_-DE-LOCATION_ERR-are-collective.patch><0002-Dead-lock-bug-in-MatShift_MPIAIJ.patch>




More information about the petsc-dev mailing list