[petsc-users] PETSc GPU MatMatMult performance question
Rohan Yadav
rohany at alumni.cmu.edu
Wed Feb 2 23:11:33 CST 2022
Hi All,
I'm trying to use the MatMatMult function with 1 sparse matrix B and two
dense matrices A, C. I'm computing A = B * C.
My code is below:
```
void spmm(Mat B, int warmup, int niter) {
Mat A, C;
PetscInt i, j = 32, k;
MatGetSize(B, &i, &k);
MatCreateDenseCUDA(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, i,
j, NULL, &A);
MatCreateDenseCUDA(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, k,
j, NULL, &C);
// Initialize entries in the output.
MatZeroEntries(A);
setMatToConstant(C, 1.0);
// Finally, do the computation.
auto avgTime = benchmarkWithWarmup(warmup, niter, [&]() {
MatMatMult(B, C, MAT_REUSE_MATRIX, PETSC_DEFAULT, &A);
});
PetscPrintf(PETSC_COMM_WORLD, "Average time: %lf ms.\n", avgTime * 1000);
}
```
where benchmarkWithWarmup is a simple wrapper function that runs a
lambda several times.
I'm running this function with arguments `-vec_type cuda -mat_type
aijcusparse`, and see that the performance is relatively slow. I'm
wondering if I'm using the API incorrectly, or the computation is
executing as expected. `nvprof` shows that much of the time is spent
in a device to host memcpys:
```
Type Time(%) Time Calls Avg Min
Max Name
GPU activities: 87.32% 11.9978s 33 363.57ms 1.5040us
388.26ms [CUDA memcpy DtoH]
8.71% 1.19611s 30 39.870ms 37.421ms
39.976ms void cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned
int=128, bool=0, bool=0, unsigned int=8, unsigned int=16, unsigned
int=4, unsigned int=0>, int, double, double, double>(bool=0,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double>,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double>, bool=0 const *, bool=0 const ,
bool=0, bool=0, int, cusparseOperation_t,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double> const *,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double> const , unsigned int=8 const *,
unsigned int=8 const ,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double>, unsigned int=16*,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double>)
3.87% 531.56ms 14 37.968ms 1.0240us
227.29ms [CUDA memcpy HtoD]
0.07% 9.7452ms 6 1.6242ms 1.0880us
3.2481ms [CUDA memset]
0.02% 2.8727ms 1 2.8727ms 2.8727ms
2.8727ms void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<double>,
double>, unsigned long>,
thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<double>,
double>, unsigned long>(thrust::device_ptr<double>, double)
0.01% 1.4953ms 2 747.67us 56.188us
1.4392ms void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>,
int>, unsigned long>,
thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>,
int>, unsigned long>(thrust::device_ptr<int>, int)```
The logview output is:
```
---------------------------------------------- PETSc Performance
Summary: ----------------------------------------------
/g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen457 with 2
processors, by yadav2 Wed Feb 2 17:23:19 2022
Using Petsc Release Version 3.16.3, unknown
Max Max/Min Avg Total
Time (sec): 1.163e+02 1.000 1.163e+02
Objects: 4.800e+01 1.000 4.800e+01
Flop: 6.338e+11 1.065 6.144e+11 1.229e+12
Flop/sec: 5.451e+09 1.065 5.284e+09 1.057e+10
MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01
MPI Message Lengths: 2.544e+09 1.000 7.267e+07 5.087e+09
MPI Reductions: 8.100e+01 1.000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length
N --> 2N flop
and VecAXPY() for complex vectors of
length N --> 8N flop
Summary of Stages: ----- Time ------ ----- Flop ------ ---
Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total Count
%Total Avg %Total Count %Total
0: Main Stage: 1.1628e+02 100.0% 1.2288e+12 100.0% 7.000e+01
100.0% 7.267e+07 100.0% 6.300e+01 77.8%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush()
and PetscLogStagePop().
%T - percent time in this phase %F - percent flop in this phase
%M - percent messages in this phase %L - percent message
lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
over all processors)
GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max
GPU time over all processors)
CpuToGpu Count: total number of CPU to GPU copies per processor
CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies
per processor)
GpuToCpu Count: total number of GPU to CPU copies per processor
GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies
per processor)
GPU %F: percent flops on GPU in this event
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flop
--- Global --- --- Stage ---- Total GPU - CpuToGpu -
- GpuToCpu - GPU
Max Ratio Max Ratio Max Ratio Mess
AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count
Size Count Size %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
BuildTwoSided 2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00
4.0e+00 2.0e+00 0 0 3 0 2 0 0 3 0 3 0 0 0
0.00e+00 0 0.00e+00 0
BuildTwoSidedF 1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00
0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0
0.00e+00 0 0.00e+00 0
MatAssemblyBegin 32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00
0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0
0.00e+00 0 0.00e+00 0
MatAssemblyEnd 32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00
0.0e+00 6.0e+00 1 0 0 0 7 1 0 0 0 10 0 0 0
0.00e+00 0 0.00e+00 0
MatZeroEntries 1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
0.00e+00 0 0.00e+00 0
MatLoad 1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00
-8.8e+07 2.1e+01 15 0 9-10 26 15 0 9-10 33 0 0 0
0.00e+00 0 0.00e+00 0
MatMatMultSym 60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00
7.3e+05 3.2e+01 1 0 6 0 40 1 0 6 0 51 0 0 0
0.00e+00 0 0.00e+00 0
MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01
9.4e+07 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2
6.71e+03 30 8.73e+04 98
MatCUSPARSCopyTo 1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1
3.80e+03 0 0.00e+00 0
MatDenseCopyTo 1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1
2.91e+03 0 0.00e+00 0
MatDenseCopyFrom 31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 0 0
0.00e+00 31 9.02e+04 0
VecSet 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
0.00e+00 0 0.00e+00 0
SFSetGraph 1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
0.00e+00 0 0.00e+00 0
SFSetUp 1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00
7.3e+05 1.0e+00 0 0 6 0 1 0 0 6 0 2 0 0 0
0.00e+00 0 0.00e+00 0
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Matrix 37 30 2867511840 0.
Viewer 2 0 0 0.
Vector 4 1 1792 0.
Index Set 2 2 1495248 0.
Star Forest Graph 3 0 0 0.
========================================================================================================================
Average time to get PetscTime(): 3.83e-08
Average time for MPI_Barrier(): 7.874e-07
Average time for zero size MPI_Send(): 3.4035e-06
#PETSc Option Table entries:
-bench spmm
-enable_gpu
-log_view
-mat_type aijcusparse
-matload_block_size 1
-matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc
-n 20
-vec_type cuda
-warmup 10
```
Thanks,
Rohan Yadav
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220202/21d2dbb6/attachment.html>
More information about the petsc-users
mailing list