[petsc-dev] Bug in MatShift_MPIAIJ ?

Eric Chamberland Eric.Chamberland at giref.ulaval.ca
Wed Oct 7 14:12:12 CDT 2015


Just another information....

in optimized version of 3.6.2, the processes are waiting at different 
points, but the process with entries in the matrix is waiting here:

#007: /opt/openmpi-1.10/lib64/libmpi.so.12(PMPI_Allreduce+0x183) 
[0x7fa4e9081443]
#008: 
/opt/petsc-3.6.2_opt_openmpi-1.10_mkl_mt/lib/libpetsc.so.3.6(MatAssemblyEnd_MPIAIJ+0x627) 
[0x7fa4eb90757d]
#009: 
/opt/petsc-3.6.2_opt_openmpi-1.10_mkl_mt/lib/libpetsc.so.3.6(MatAssemblyEnd+0xf6) 
[0x7fa4eb97733d]
#010: 
/opt/petsc-3.6.2_opt_openmpi-1.10_mkl_mt/lib/libpetsc.so.3.6(MatShift_Basic+0x1e5) 
[0x7fa4eb6de435]
#011: 
/opt/petsc-3.6.2_opt_openmpi-1.10_mkl_mt/lib/libpetsc.so.3.6(MatShift_MPIAIJ+0xf6) 
[0x7fa4eb90da45]
#012: 
/opt/petsc-3.6.2_opt_openmpi-1.10_mkl_mt/lib/libpetsc.so.3.6(MatShift+0x9e) 
[0x7fa4eb6dbf94]


while the other continued...

thanks again!

Eric

On 07/10/15 03:02 PM, Eric Chamberland wrote:
> Hi Barry,
>
> just compiled/tested with 3.6.2.
>
> On a 2 processes example, the non-debug version is hanging silently and
> indefinitely but in debug mode, it now abort when calling MatShift when
> a process has no entries:
>
> #0  0x00007fffdbc6c065 in raise () from /lib64/libc.so.6
> #1  0x00007fffdbc6d4e8 in abort () from /lib64/libc.so.6
> #2  0x00007fffe2aea789 in PetscTraceBackErrorHandler (comm=0x22d9630,
> line=5264, fun=0x7fffe463b74f <__func__.21315> "MatSetOption",
> file=0x7fffe46384b8
> "/home/mefpp_ericc/petsc-3.6.2/src/mat/interface/matrix.c", n=62,
> p=PETSC_ERROR_INITIAL, mess=0x7ffffffd1100 "Enum value must be same on
> all processes, argument # 2", ctx=0x0) at
> /home/mefpp_ericc/petsc-3.6.2/src/sys/error/errtrace.c:243
> #3  0x00007fffe2ae53ae in PetscError (comm=0x22d9630, line=5264,
> func=0x7fffe463b74f <__func__.21315> "MatSetOption", file=0x7fffe46384b8
> "/home/mefpp_ericc/petsc-3.6.2/src/mat/interface/matrix.c", n=62,
> p=PETSC_ERROR_INITIAL, mess=0x7fffe4639b78 "Enum value must be same on
> all processes, argument # %d") at
> /home/mefpp_ericc/petsc-3.6.2/src/sys/error/err.c:377
> #4  0x00007fffe33eeae3 in MatSetOption (mat=0x2b06d00,
> op=MAT_NO_OFF_PROC_ENTRIES, flg=PETSC_FALSE) at
> /home/mefpp_ericc/petsc-3.6.2/src/mat/interface/matrix.c:5264
> #5  0x00007fffe2e4126e in MatShift_Basic (Y=0x2b06d00, a=1) at
> /home/mefpp_ericc/petsc-3.6.2/src/mat/utils/gcreate.c:22
> #6  0x00007fffe330de85 in MatShift_MPIAIJ (Y=0x2b06d00, a=1) at
> /home/mefpp_ericc/petsc-3.6.2/src/mat/impls/aij/mpi/mpiaij.c:2614
> #7  0x00007fffe2e3b1b4 in MatShift (Y=0x2b06d00, a=1) at
> /home/mefpp_ericc/petsc-3.6.2/src/mat/utils/axpy.c:171
>
> Is it possible for us to only apply the "missing" patch that you
> mentioned (the one which breaks the ABI...) but may make petsc usable
> for us?
>
> Thanks!
>
> Eric
>
>
> On 15/08/15 01:34 PM, Barry Smith wrote:
>>
>>    I have merged the bug fix for both into master.
>>
>>    I have merged the bug fix for MatShift() into maint. Due to Jed's
>> concern about "breaking the ABI" in the release I have not merged the
>> error checker fix for the possible deadlock with ->nonew having
>> different values on different processes into maint.
>>
>>     Barry
>>
>>> On Aug 14, 2015, at 6:16 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>
>>>
>>>    Patrick
>>>
>>>      Thanks for reporting these two bugs.
>>>
>>>> On Aug 12, 2015, at 9:40 AM, Patrick Lacasse
>>>> <patrick.m.lacasse at gmail.com> wrote:
>>>>
>>>> I have a dead lock in MatAssemblyEnd_MPIAIJ.
>>>> It uses nonew as collective :
>>>> if (!((Mat_SeqAIJ*)aij->B->data)->nonew) {
>>>>     ierr = MPI_Allreduce( ...
>>>>
>>>> this is a good optimization for me, however this value is set by
>>>> MatSetOptions(,MAT_NEW_NONZERO_ALLOCATION_ERR,)
>>>> and this enum is negative
>>>> MAT_NEW_NONZERO_ALLOCATION_ERR = -2
>>>> thus it is not asserted as collective by MatSetOption.
>>>>
>>>> Some of my procs have nonew==0 and one has nonew==-2.
>>>> I'm not sure why adding new nonzero should be an error only for some
>>>> procs?
>>>> I suggest that this error flag should be set collectively, so
>>>> changing the value of the enum to positive (patch 0001).
>>>
>>> Fixed in branches barry/fix-nonew-notcollective/maint and commit
>>> 60bf598 not yet merged into next because I get a conflict with
>>> something Jed put in a long time ago that has no branch (Jed see
>>> other email).
>>>
>>>
>>> MAT_NEW_NONZERO_LOCATION_ERR and MAT_NEW_NONZERO_ALLOCATION_ERR  must
>>> be collective because they change the value of ->nonew which is
>>> used to decide if some MPI_Allreduce() are called. Thus with
>>> different values the code could hang
>>>
>>> Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>
>>>
>>>>
>>>> My reel problem was caused by MatShift_MPIAIJ and the way it
>>>> determine if the matrix need to be preallocated :
>>>> if (!aij->nz && !bij->nz)
>>>> the results can be true for some procs (with no local lines)
>>>> and false for other procs.
>>>> I suggest to use Y->preallocated instead (patch 0002).
>>>
>>> Fixed in branches barry/fix-matshift/maint and next and commit
>>> 6f33a89  will merge into maint and master after testing.
>>>
>>> MatShift_MPI/SeqXAIJ() could hang if some processes had no entries on
>>> a process while others had entries
>>> because some processes would attempt a parallel preallocation and the
>>> others would not.
>>>
>>> Fixed by first checking if no preallocation was done, and if not
>>> doing. Otherwise preallocation is only done
>>> if approprate by each process on the diagonal block portion of the
>>> matrix, thus not requiring all processes
>>> that share the matrix to call the parallel preallocation routine
>>>
>>> Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>
>>>
>>>>
>>>
>>>> thanks,
>>>>
>>>> Patrick Lacasse
>>>>
>>>>
>>>>
>>>>
>>>> <0001-MAT_NEW_NONZERO_-DE-LOCATION_ERR-are-collective.patch><0002-Dead-lock-bug-in-MatShift_MPIAIJ.patch>
>>>>
>>>




More information about the petsc-dev mailing list