[petsc-dev] problem with MatSeqAIJCUSPARSEILUAnalysisAndCopyToGPU

Tue Dec 22 15:38:09 CST 2020

I am MPI serial LU solving a smallish matrix (2D, Q3, 8K equations) on a
Summit node (42 P9 cores, 6 V100 GPUs) using cuSparse and Kokkos kernels.
The cuSparse performance is terrible.

I solve the same TS problem in MPI serial on each global process. I run
with NP=1 or (all) 7 cores/MPI per GPU:
MatLUFactorNum time, using all 6 GPUs:
NP/GPU cuSparse Kokkos kernels
1      0.12     0.075
7      0.55     0.072 // some noise here
So cuSparse is about 2x slower on one process and 8x slower when using all
the cores, from memory contention I assume.

I found that the problem is
in MatSeqAIJCUSPARSEBuildILULower[Upper]TriMatrix. Most of this excess time
is in:

      cerr = cudaMallocHost((void**) &AALo,
nzLower*sizeof(PetscScalar));CHKERRCUDA(cerr);

and

      cerr = cudaFreeHost(AALo);CHKERRCUDA(cerr);

nzLower is about 140K. Here is my timer data, in a stage after a "warm up
stage":

   Inner-MatSeqAIJCUSPARSEBuildILULowerTriMatrix      12 1.0 *2.3514e-01 *1.1
0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0  23  0  0  0  0     0
    0     12 1.34e+01    0 0.00e+00  0
   MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaMallocHost      12 1.0
*1.5448e-01* 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0  15  0
 0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
     MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaFreeHost      12 1.0
*8.3908e-02
*1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   8  0  0  0  0
  0       0      0 0.00e+00    0 0.00e+00  0

This 0.23 sec happens in Upper also, for a total of ~0.46, which pretty
much matches the difference with Kokkos.

Any ideas?

Thanks,
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20201222/c8bc9cef/attachment.html>