[petsc-dev] Bug in MatShift_MPIAIJ ?

Wed Oct 7 14:10:18 CDT 2015

Yeah - you can grab the commit 60bf598 mentioned below from:

https://bitbucket.org/petsc/petsc/commits/60bf598

[grab from 'View raw commit' - and apply to petsc-3.6 with 'patch -Np1 < patchfile']

Or use git to obtain petsc.

branch 'maint' would have petsc-3.6
branch 'barry/fix-nonew-notcollective/maint' has the fix.

[You can merge the 2 into a new branch - that you can update with
future maint fixes]

Satish

On Wed, 7 Oct 2015, Eric Chamberland wrote:

> Hi Barry,
> 
> just compiled/tested with 3.6.2.
> 
> On a 2 processes example, the non-debug version is hanging silently and
> indefinitely but in debug mode, it now abort when calling MatShift when a
> process has no entries:
> 
> #0  0x00007fffdbc6c065 in raise () from /lib64/libc.so.6
> #1  0x00007fffdbc6d4e8 in abort () from /lib64/libc.so.6
> #2  0x00007fffe2aea789 in PetscTraceBackErrorHandler (comm=0x22d9630,
> line=5264, fun=0x7fffe463b74f <__func__.21315> "MatSetOption",
> file=0x7fffe46384b8
> "/home/mefpp_ericc/petsc-3.6.2/src/mat/interface/matrix.c", n=62,
> p=PETSC_ERROR_INITIAL, mess=0x7ffffffd1100 "Enum value must be same on all
> processes, argument # 2", ctx=0x0) at
> /home/mefpp_ericc/petsc-3.6.2/src/sys/error/errtrace.c:243
> #3  0x00007fffe2ae53ae in PetscError (comm=0x22d9630, line=5264,
> func=0x7fffe463b74f <__func__.21315> "MatSetOption", file=0x7fffe46384b8
> "/home/mefpp_ericc/petsc-3.6.2/src/mat/interface/matrix.c", n=62,
> p=PETSC_ERROR_INITIAL, mess=0x7fffe4639b78 "Enum value must be same on all
> processes, argument # %d") at
> /home/mefpp_ericc/petsc-3.6.2/src/sys/error/err.c:377
> #4  0x00007fffe33eeae3 in MatSetOption (mat=0x2b06d00,
> op=MAT_NO_OFF_PROC_ENTRIES, flg=PETSC_FALSE) at
> /home/mefpp_ericc/petsc-3.6.2/src/mat/interface/matrix.c:5264
> #5  0x00007fffe2e4126e in MatShift_Basic (Y=0x2b06d00, a=1) at
> /home/mefpp_ericc/petsc-3.6.2/src/mat/utils/gcreate.c:22
> #6  0x00007fffe330de85 in MatShift_MPIAIJ (Y=0x2b06d00, a=1) at
> /home/mefpp_ericc/petsc-3.6.2/src/mat/impls/aij/mpi/mpiaij.c:2614
> #7  0x00007fffe2e3b1b4 in MatShift (Y=0x2b06d00, a=1) at
> /home/mefpp_ericc/petsc-3.6.2/src/mat/utils/axpy.c:171
> 
> Is it possible for us to only apply the "missing" patch that you mentioned
> (the one which breaks the ABI...) but may make petsc usable for us?
> 
> Thanks!
> 
> Eric
> 
> 
> On 15/08/15 01:34 PM, Barry Smith wrote:
> > 
> >    I have merged the bug fix for both into master.
> > 
> >    I have merged the bug fix for MatShift() into maint. Due to Jed's concern
> > about "breaking the ABI" in the release I have not merged the error checker
> > fix for the possible deadlock with ->nonew having different values on
> > different processes into maint.
> > 
> >     Barry
> > 
> > > On Aug 14, 2015, at 6:16 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> > > 
> > > 
> > >    Patrick
> > > 
> > >      Thanks for reporting these two bugs.
> > > 
> > > > On Aug 12, 2015, at 9:40 AM, Patrick Lacasse
> > > > <patrick.m.lacasse at gmail.com> wrote:
> > > > 
> > > > I have a dead lock in MatAssemblyEnd_MPIAIJ.
> > > > It uses nonew as collective :
> > > > if (!((Mat_SeqAIJ*)aij->B->data)->nonew) {
> > > >     ierr = MPI_Allreduce( ...
> > > > 
> > > > this is a good optimization for me, however this value is set by
> > > > MatSetOptions(,MAT_NEW_NONZERO_ALLOCATION_ERR,)
> > > > and this enum is negative
> > > > MAT_NEW_NONZERO_ALLOCATION_ERR = -2
> > > > thus it is not asserted as collective by MatSetOption.
> > > > 
> > > > Some of my procs have nonew==0 and one has nonew==-2.
> > > > I'm not sure why adding new nonzero should be an error only for some
> > > > procs?
> > > > I suggest that this error flag should be set collectively, so changing
> > > > the value of the enum to positive (patch 0001).
> > > 
> > > Fixed in branches barry/fix-nonew-notcollective/maint and commit 60bf598
> > > not yet merged into next because I get a conflict with something Jed put
> > > in a long time ago that has no branch (Jed see other email).
> > > 
> > > 
> > > MAT_NEW_NONZERO_LOCATION_ERR and MAT_NEW_NONZERO_ALLOCATION_ERR  must be
> > > collective because they change the value of ->nonew which is
> > > used to decide if some MPI_Allreduce() are called. Thus with different
> > > values the code could hang
> > > 
> > > Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>
> > > 
> > > > 
> > > > My reel problem was caused by MatShift_MPIAIJ and the way it determine
> > > > if the matrix need to be preallocated :
> > > > if (!aij->nz && !bij->nz)
> > > > the results can be true for some procs (with no local lines)
> > > > and false for other procs.
> > > > I suggest to use Y->preallocated instead (patch 0002).
> > > 
> > > Fixed in branches barry/fix-matshift/maint and next and commit 6f33a89
> > > will merge into maint and master after testing.
> > > 
> > > MatShift_MPI/SeqXAIJ() could hang if some processes had no entries on a
> > > process while others had entries
> > > because some processes would attempt a parallel preallocation and the
> > > others would not.
> > > 
> > > Fixed by first checking if no preallocation was done, and if not doing.
> > > Otherwise preallocation is only done
> > > if approprate by each process on the diagonal block portion of the matrix,
> > > thus not requiring all processes
> > > that share the matrix to call the parallel preallocation routine
> > > 
> > > Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>
> > > 
> > > > 
> > > 
> > > > thanks,
> > > > 
> > > > Patrick Lacasse
> > > > 
> > > > 
> > > > 
> > > > 
> > > > <0001-MAT_NEW_NONZERO_-DE-LOCATION_ERR-are-collective.patch><0002-Dead-lock-bug-in-MatShift_MPIAIJ.patch>
> > > 
> 
>