[petsc-dev] Errors in MatSetValuesBlockedLocal() when using AIJ, but not BAIJ!
Smith, Barry F.
bsmith at mcs.anl.gov
Thu Feb 21 23:41:25 CST 2019
> On Feb 21, 2019, at 8:28 PM, Mills, Richard Tran <rtmills at anl.gov> wrote:
>
> On 2/21/19 1:48 PM, Smith, Barry F. wrote:
>>
>>> On Feb 20, 2019, at 12:50 PM, Mills, Richard Tran via petsc-dev <petsc-dev at mcs.anl.gov>
>>> wrote:
>>>
>>> Folks,
>>>
>>> I'm working with some PFLOTRAN examples, and I've been scratching my head for a while at errors that are being generated in MatSetValuesBlockedLocal(), related to new nonzeros requiring a malloc. E.g.,
>>>
>>> [0]PETSC ERROR: Argument out of range
>>> [0]PETSC ERROR: New nonzero at (2,0) caused a malloc
>>>
>>> The strange thing is that all of these examples run fine when I use the PFLOTRAN default BAIJ matrix type (with blocks column-oriented) -- I only see the issue when I switch to using AIJ. If I set the matrix option MAT_NEW_NONZERO_ALLOCATION_ERR to PETSC_FALSE to allow things to run with AIJ, everything seems to work perfectly. (And once the Jacobian matrix is constructed the first time, the memory usage is not growing according to what I see from -malloc_log.)
>>>
>>> I've spent some time poking around with a debugger in the variants of MatSetValues() and MatXAIJSetPreallocation(), and I haven't been able make much progress in determining why I'm seeing this in AIJ vs. BAIJ -- it's easy to get confused looking at this code, and it's going to take me more time to follow what is going on than I've had to spare so far. So, I've got a few questions. First, should it even be possible to see what I am seeing? That is, if the MatSetValuesBlocked() routine is not causing new allocations when using BAIJ, should this be possible with AIJ?
>>>
>>
>> Sure, because the preallocation for BAIJ matrix is by block, while for AIJ it is by point (even if you provide a block size to AIJ). So long as the new entry is within an allocated block it will not trigger an error with BAIJ but may with AIJ.
>>
> Hmm. It seems that I've misunderstood what MatXAIJSetPreallocation() is supposed to do. I thought that if I pass in a blocksize > 1, MatXAIJSetPreallocation() will turn the block-row preallocation into the equivalent scalar-row stuff. This seems to be what is happening in the code, unless I'm reading this wrong:
>
> [...]
> 304: } else { /* Convert block-row precallocation to scalar-row */
> 305: PetscInt i,m,*sdnnz,*sonnz;
> 306: MatGetLocalSize(A,&m,NULL);
> 307: PetscMalloc2((!!dnnz)*m,&sdnnz,(!!onnz)*m,&sonnz);
> 308: for (i=0; i<m; i++) {
> 309: if (dnnz) sdnnz[i] = dnnz[i/bs] * bs;
> 310: if (onnz) sonnz[i] = onnz[i/bs] * bs;
> 311: }
> 312: MatSeqAIJSetPreallocation(A,0,dnnz ? sdnnz : NULL);
> 313: MatMPIAIJSetPreallocation(A,0,dnnz ? sdnnz : NULL,0,onnz ? sonnz : NULL);
> [...]
I think your understanding above is correct.
>
> Though might something be happening later that reduces the amount of nonzeros allocated, since the "dense" blocks of the matrix actually will have zero entries that never get a value set?
>
> Something does appear to be screwy when I look at the "-snes_view" output for a time-dependent problem where I see these errors. When using BAIJ, at every time step I see
>
> Mat Object: Mat_0x84000000_0 (flow_) 1 MPI processes
> type: seqbaij
> rows=144, cols=144, bs=3
> total: nonzeros=2304, allocated nonzeros=2304
> total number of mallocs used during MatSetValues calls =0
> block size is 3
>
> but when I switch to AIJ, I initially see
>
> Mat Object: Mat_0x84000000_0 (flow_) 1 MPI processes
> type: seqaij
> rows=144, cols=144, bs=3
> total: nonzeros=1584, allocated nonzeros=3324
> total number of mallocs used during MatSetValues calls =68
> using I-node routines: found 96 nodes, limit used is 5
>
> but as things progress, the allocated nonzeros and mallocs count keeps growing; by step 43 I'm seeing
>
> total: nonzeros=1584, allocated nonzeros=115524
> total number of mallocs used during MatSetValues calls =7548
>
> If I dump the matrices and look at them in something like Octave, the matrices from the BAIJ and AIJ runs appear identical within round-off error, and we certainly aren't adding any entries at each step. Why should PETSc keep calling malloc() at each MatSetValues() call, then? I suspect a bug somewhere in the AIJ case call sequence of MatSetValuesBlockedLocal() --> MatSetValuesBlocked() --> MatSetValues() --> MatSetValues_SeqAIJ(). I have trouble following all of the logic in matsetvaluesseqaij_(), though, so I guess I'm going to have try harder to understand why MatSeqXAIJReallocateAIJ() is being hit there.
>
> Any educated guesses about what might be going wrong would be appreciated. There is a lot of PETSc code I'm not familiar with involved here.
Suggest a breakpoint in the MatSetValues_SeqAIJ location where the malloc() is done to determine why the mallocs keep getting called.
Barry
>
> Cheers,
> Richard
>
>
>>
>> Barry
>>
>>
>>> (Clearly it *is* possible, but should it be?) I'm still trying to figure out if there is something wrong with what PFLOTRAN is doing, vs. something going wrong somewhere inside PETSc.
>>>
>>> Any hints about what to look at from someone with more familiarity with this code would be appreciated.
>>>
>>> --Richard
>>>
>
More information about the petsc-dev
mailing list