[petsc-dev] Bug in MatShift_MPIAIJ ?

Eric Chamberland Eric.Chamberland at giref.ulaval.ca
Wed Oct 7 14:02:38 CDT 2015


Hi Barry,

just compiled/tested with 3.6.2.

On a 2 processes example, the non-debug version is hanging silently and 
indefinitely but in debug mode, it now abort when calling MatShift when 
a process has no entries:

#0  0x00007fffdbc6c065 in raise () from /lib64/libc.so.6
#1  0x00007fffdbc6d4e8 in abort () from /lib64/libc.so.6
#2  0x00007fffe2aea789 in PetscTraceBackErrorHandler (comm=0x22d9630, 
line=5264, fun=0x7fffe463b74f <__func__.21315> "MatSetOption", 
file=0x7fffe46384b8 
"/home/mefpp_ericc/petsc-3.6.2/src/mat/interface/matrix.c", n=62, 
p=PETSC_ERROR_INITIAL, mess=0x7ffffffd1100 "Enum value must be same on 
all processes, argument # 2", ctx=0x0) at 
/home/mefpp_ericc/petsc-3.6.2/src/sys/error/errtrace.c:243
#3  0x00007fffe2ae53ae in PetscError (comm=0x22d9630, line=5264, 
func=0x7fffe463b74f <__func__.21315> "MatSetOption", file=0x7fffe46384b8 
"/home/mefpp_ericc/petsc-3.6.2/src/mat/interface/matrix.c", n=62, 
p=PETSC_ERROR_INITIAL, mess=0x7fffe4639b78 "Enum value must be same on 
all processes, argument # %d") at 
/home/mefpp_ericc/petsc-3.6.2/src/sys/error/err.c:377
#4  0x00007fffe33eeae3 in MatSetOption (mat=0x2b06d00, 
op=MAT_NO_OFF_PROC_ENTRIES, flg=PETSC_FALSE) at 
/home/mefpp_ericc/petsc-3.6.2/src/mat/interface/matrix.c:5264
#5  0x00007fffe2e4126e in MatShift_Basic (Y=0x2b06d00, a=1) at 
/home/mefpp_ericc/petsc-3.6.2/src/mat/utils/gcreate.c:22
#6  0x00007fffe330de85 in MatShift_MPIAIJ (Y=0x2b06d00, a=1) at 
/home/mefpp_ericc/petsc-3.6.2/src/mat/impls/aij/mpi/mpiaij.c:2614
#7  0x00007fffe2e3b1b4 in MatShift (Y=0x2b06d00, a=1) at 
/home/mefpp_ericc/petsc-3.6.2/src/mat/utils/axpy.c:171

Is it possible for us to only apply the "missing" patch that you 
mentioned (the one which breaks the ABI...) but may make petsc usable 
for us?

Thanks!

Eric


On 15/08/15 01:34 PM, Barry Smith wrote:
>
>    I have merged the bug fix for both into master.
>
>    I have merged the bug fix for MatShift() into maint. Due to Jed's concern about "breaking the ABI" in the release I have not merged the error checker fix for the possible deadlock with ->nonew having different values on different processes into maint.
>
>     Barry
>
>> On Aug 14, 2015, at 6:16 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>
>>
>>    Patrick
>>
>>      Thanks for reporting these two bugs.
>>
>>> On Aug 12, 2015, at 9:40 AM, Patrick Lacasse <patrick.m.lacasse at gmail.com> wrote:
>>>
>>> I have a dead lock in MatAssemblyEnd_MPIAIJ.
>>> It uses nonew as collective :
>>> if (!((Mat_SeqAIJ*)aij->B->data)->nonew) {
>>>     ierr = MPI_Allreduce( ...
>>>
>>> this is a good optimization for me, however this value is set by
>>> MatSetOptions(,MAT_NEW_NONZERO_ALLOCATION_ERR,)
>>> and this enum is negative
>>> MAT_NEW_NONZERO_ALLOCATION_ERR = -2
>>> thus it is not asserted as collective by MatSetOption.
>>>
>>> Some of my procs have nonew==0 and one has nonew==-2.
>>> I'm not sure why adding new nonzero should be an error only for some procs?
>>> I suggest that this error flag should be set collectively, so changing the value of the enum to positive (patch 0001).
>>
>> Fixed in branches barry/fix-nonew-notcollective/maint and commit 60bf598 not yet merged into next because I get a conflict with something Jed put in a long time ago that has no branch (Jed see other email).
>>
>>
>> MAT_NEW_NONZERO_LOCATION_ERR and MAT_NEW_NONZERO_ALLOCATION_ERR  must be collective because they change the value of ->nonew which is
>> used to decide if some MPI_Allreduce() are called. Thus with different values the code could hang
>>
>> Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>
>>
>>>
>>> My reel problem was caused by MatShift_MPIAIJ and the way it determine if the matrix need to be preallocated :
>>> if (!aij->nz && !bij->nz)
>>> the results can be true for some procs (with no local lines)
>>> and false for other procs.
>>> I suggest to use Y->preallocated instead (patch 0002).
>>
>> Fixed in branches barry/fix-matshift/maint and next and commit 6f33a89  will merge into maint and master after testing.
>>
>> MatShift_MPI/SeqXAIJ() could hang if some processes had no entries on a process while others had entries
>> because some processes would attempt a parallel preallocation and the others would not.
>>
>> Fixed by first checking if no preallocation was done, and if not doing. Otherwise preallocation is only done
>> if approprate by each process on the diagonal block portion of the matrix, thus not requiring all processes
>> that share the matrix to call the parallel preallocation routine
>>
>> Reported-by: Patrick Lacasse <patrick.m.lacasse at gmail.com>
>>
>>>
>>
>>> thanks,
>>>
>>> Patrick Lacasse
>>>
>>>
>>>
>>>
>>> <0001-MAT_NEW_NONZERO_-DE-LOCATION_ERR-are-collective.patch><0002-Dead-lock-bug-in-MatShift_MPIAIJ.patch>
>>




More information about the petsc-dev mailing list