[petsc-dev] problem with MatSeqAIJCUSPARSEILUAnalysisAndCopyToGPU
Mark Adams
mfadams at lbl.gov
Tue Dec 22 15:38:09 CST 2020
I am MPI serial LU solving a smallish matrix (2D, Q3, 8K equations) on a
Summit node (42 P9 cores, 6 V100 GPUs) using cuSparse and Kokkos kernels.
The cuSparse performance is terrible.
I solve the same TS problem in MPI serial on each global process. I run
with NP=1 or (all) 7 cores/MPI per GPU:
MatLUFactorNum time, using all 6 GPUs:
NP/GPU cuSparse Kokkos kernels
1 0.12 0.075
7 0.55 0.072 // some noise here
So cuSparse is about 2x slower on one process and 8x slower when using all
the cores, from memory contention I assume.
I found that the problem is
in MatSeqAIJCUSPARSEBuildILULower[Upper]TriMatrix. Most of this excess time
is in:
cerr = cudaMallocHost((void**) &AALo,
nzLower*sizeof(PetscScalar));CHKERRCUDA(cerr);
and
cerr = cudaFreeHost(AALo);CHKERRCUDA(cerr);
nzLower is about 140K. Here is my timer data, in a stage after a "warm up
stage":
Inner-MatSeqAIJCUSPARSEBuildILULowerTriMatrix 12 1.0 *2.3514e-01 *1.1
0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 23 0 0 0 0 0
0 12 1.34e+01 0 0.00e+00 0
MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaMallocHost 12 1.0
*1.5448e-01* 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 15 0
0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaFreeHost 12 1.0
*8.3908e-02
*1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 8 0 0 0 0
0 0 0 0.00e+00 0 0.00e+00 0
This 0.23 sec happens in Upper also, for a total of ~0.46, which pretty
much matches the difference with Kokkos.
Any ideas?
Thanks,
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20201222/c8bc9cef/attachment.html>
More information about the petsc-dev
mailing list