[petsc-users] MatMultTranspose memory usage

Mills, Richard Tran rtmills at anl.gov
Thu Jul 18 14:35:57 CDT 2019

Hi Kun and Karl,

If you are using the AIJMKL matrix types and have a recent version of MKL, the AIJMKL code uses MKL's inspector-executor sparse BLAS routines, which are described at


The inspector-executor analysis routines take the AIJ (compressed sparse row) format data from PETSc and then create a copy in an optimized, internal layout used by MKL. We have to keep PETSc's own, AIJ representation around, as it is needed for several operations that MKL does not provide. This does, unfortunately, mean that roughly double (or more, depending on what MKL decides to do) the amount of memory is required. The reason you see the memory usage increase right when a MatMult() or MatMultTranspose() operation occurs is that the we default to a "lazy" approach to calling the analysis routine (mkl_sparse_optimize()) until an operation that uses an MKL-provided kernel is requested. (You can use an "eager" approach that calls mkl_sparse_optimize() during MatAssemblyEnd() by specifying "-mat_aijmkl_eager_inspection" in the PETSc options.)

If memory is at enough of a premium for you that you can't afford the extra copy used by the MKL inspector-executor routines, then I suggest using the usual PETSc AIJ format instead of AIJMKL. AIJ is fairly well optimized for many cases (and even has some hand-optimized kernels using Intel AVX/AVX2/AVX-512 intrinsics) and often outperforms AIJMKL. You should try both AIJ and AIJMKL, anyway, to see which is faster for your combination of problem and computing platform.

Best regards,

On 7/17/19 8:46 PM, Karl Lin via petsc-users wrote:
We also found that if we use MatCreateSeqAIJ, then no more memory increase with matrix vector multiplication. However, with MatCreateMPIAIJMKL, the behavior is consistent.

On Wed, Jul 17, 2019 at 5:26 PM Karl Lin <karl.linkui at gmail.com<mailto:karl.linkui at gmail.com>> wrote:

parallel and sequential exhibit the same behavior. In fact, we found that doing matmult will increase the memory by the size of matrix as well.

On Wed, Jul 17, 2019 at 4:55 PM Zhang, Hong <hzhang at mcs.anl.gov<mailto:hzhang at mcs.anl.gov>> wrote:
What matrix format do you use? Run it in parallel or sequential?

We used /proc/self/stat to track the resident set size during program run, and we saw the resident set size jumped by the size of the matrix right after we did matmulttranspose.

On Wed, Jul 17, 2019 at 12:04 PM hong--- via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
How do you know 'MatMultTranpose creates an extra memory copy of matrix'?


I was using MatMultTranpose and MatMult to solver a linear system.

However we found out, MatMultTranpose create an extra memory copy of matrix for its operation. This extra memory copy is not stated everywhere in petsc manual.

This basically double my memory requirement to solve my system.

I remember mkl’s routine can do inplace matrix transpose vector product, without transposing the matrix itself.

Is this always the case? Or there is way to make petsc to do inplace matrix transpose vector product.

Any help is greatly appreciated.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190718/cbca1d46/attachment.html>

More information about the petsc-users mailing list