<div dir="ltr">Hi All,<div><br></div><div>I'm trying to use the MatMatMult function with 1 sparse matrix B and two dense matrices A, C. I'm computing A = B * C.</div><div><br></div><div>My code is below:<br><br>```<br><pre style="color:rgb(0,0,0);white-space:pre-wrap">void spmm(Mat B, int warmup, int niter) {
Mat A, C;
PetscInt i, j = 32, k;
MatGetSize(B, &i, &k);
MatCreateDenseCUDA(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, i, j, NULL, &A);
MatCreateDenseCUDA(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, k, j, NULL, &C);
// Initialize entries in the output.
MatZeroEntries(A);
setMatToConstant(C, 1.0);
// Finally, do the computation.
auto avgTime = benchmarkWithWarmup(warmup, niter, [&]() {
MatMatMult(B, C, MAT_REUSE_MATRIX, PETSC_DEFAULT, &A);
});
PetscPrintf(PETSC_COMM_WORLD, "Average time: %lf ms.\n", avgTime * 1000);
}</pre><pre style="color:rgb(0,0,0);white-space:pre-wrap">```</pre><pre style="overflow-wrap: break-word;"><font face="Arial, Helvetica, sans-serif"><span style="white-space:normal">where benchmarkWithWarmup is a simple wrapper function that runs a lambda several times.</span></font></pre><pre style="overflow-wrap: break-word;"><font face="Arial, Helvetica, sans-serif"><span style="white-space:normal"><br></span></font></pre><pre style="overflow-wrap: break-word;"><font face="Arial, Helvetica, sans-serif"><span style="white-space:normal">I'm running this function with arguments `-vec_type cuda -mat_type aijcusparse`, and see that the performance is relatively slow. I'm wondering if I'm using the API incorrectly, or the computation is executing as expected. `nvprof` shows that much of the time is spent in a device to host memcpys:</span></font></pre><pre style="overflow-wrap: break-word;"><font face="Arial, Helvetica, sans-serif"><span style="white-space:normal">```<br></span></font> Type Time(%) Time Calls Avg Min Max Name<br> GPU activities: 87.32% 11.9978s 33 363.57ms 1.5040us 388.26ms [CUDA memcpy DtoH]<br> 8.71% 1.19611s 30 39.870ms 37.421ms 39.976ms void cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0, bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>, int, double, double, double>(bool=0, cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0, bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>, int, double, double, double>, cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0, bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>, int, double, double, double>, bool=0 const *, bool=0 const , bool=0, bool=0, int, cusparseOperation_t, cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0, bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>, int, double, double, double> const *, cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0, bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>, int, double, double, double> const , unsigned int=8 const *, unsigned int=8 const , cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0, bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>, int, double, double, double>, unsigned int=16*, cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0, bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>, int, double, double, double>)<br> 3.87% 531.56ms 14 37.968ms 1.0240us 227.29ms [CUDA memcpy HtoD]<br> 0.07% 9.7452ms 6 1.6242ms 1.0880us 3.2481ms [CUDA memset]<br> 0.02% 2.8727ms 1 2.8727ms 2.8727ms 2.8727ms void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<double>, double>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<double>, double>, unsigned long>(thrust::device_ptr<double>, double)<br> 0.01% 1.4953ms 2 747.67us 56.188us 1.4392ms void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>(thrust::device_ptr<int>, int)
<font face="Arial, Helvetica, sans-serif"><span style="white-space:normal">```</span></font></pre><pre style="overflow-wrap: break-word;">The logview output is:</pre><pre style="overflow-wrap: break-word;">```
---------------------------------------------- PETSc Performance Summary: ----------------------------------------------<br><br>/g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen457 with 2 processors, by yadav2 Wed Feb 2 17:23:19 2022<br>Using Petsc Release Version 3.16.3, unknown<br><br> Max Max/Min Avg Total<br>Time (sec): 1.163e+02 1.000 1.163e+02<br>Objects: 4.800e+01 1.000 4.800e+01<br>Flop: 6.338e+11 1.065 6.144e+11 1.229e+12<br>Flop/sec: 5.451e+09 1.065 5.284e+09 1.057e+10<br>MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01<br>MPI Message Lengths: 2.544e+09 1.000 7.267e+07 5.087e+09<br>MPI Reductions: 8.100e+01 1.000<br><br>Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)<br> e.g., VecAXPY() for real vectors of length N --> 2N flop<br> and VecAXPY() for complex vectors of length N --> 8N flop<br><br>Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --<br> Avg %Total Avg %Total Count %Total Avg %Total Count %Total<br> 0: Main Stage: 1.1628e+02 100.0% 1.2288e+12 100.0% 7.000e+01 100.0% 7.267e+07 100.0% 6.300e+01 77.8%<br><br>------------------------------------------------------------------------------------------------------------------------<br>See the 'Profiling' chapter of the users' manual for details on interpreting output.<br>Phase summary info:<br> Count: number of times phase was executed<br> Time and Flop: Max - maximum over all processors<br> Ratio - ratio of maximum to minimum over all processors<br> Mess: number of messages sent<br> AvgLen: average message length (bytes)<br> Reduct: number of global reductions<br> Global: entire computation<br> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().<br> %T - percent time in this phase %F - percent flop in this phase<br> %M - percent messages in this phase %L - percent message lengths in this phase<br> %R - percent reductions in this phase<br> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)<br> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)<br> CpuToGpu Count: total number of CPU to GPU copies per processor<br> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)<br> GpuToCpu Count: total number of GPU to CPU copies per processor<br> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)<br> GPU %F: percent flops on GPU in this event<br>------------------------------------------------------------------------------------------------------------------------<br>Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU<br> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F<br>---------------------------------------------------------------------------------------------------------------------------------------------------------------<br><br>--- Event Stage 0: Main Stage<br><br>BuildTwoSided 2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00 2.0e+00 0 0 3 0 2 0 0 3 0 3 0 0 0 0.00e+00 0 0.00e+00 0<br>BuildTwoSidedF 1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0<br>MatAssemblyBegin 32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0<br>MatAssemblyEnd 32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 1 0 0 0 7 1 0 0 0 10 0 0 0 0.00e+00 0 0.00e+00 0<br>MatZeroEntries 1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>MatLoad 1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07 2.1e+01 15 0 9-10 26 15 0 9-10 33 0 0 0 0.00e+00 0 0.00e+00 0<br>MatMatMultSym 60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05 3.2e+01 1 0 6 0 40 1 0 6 0 51 0 0 0 0.00e+00 0 0.00e+00 0<br>MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30 8.73e+04 98<br>MatCUSPARSCopyTo 1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 3.80e+03 0 0.00e+00 0<br>MatDenseCopyTo 1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.91e+03 0 0.00e+00 0<br>MatDenseCopyFrom 31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 31 9.02e+04 0<br>VecSet 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>SFSetGraph 1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>SFSetUp 1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05 1.0e+00 0 0 6 0 1 0 0 6 0 2 0 0 0 0.00e+00 0 0.00e+00 0<br>---------------------------------------------------------------------------------------------------------------------------------------------------------------<br><br>Memory usage is given in bytes:<br><br>Object Type Creations Destructions Memory Descendants' Mem.<br>Reports information only for process 0.<br><br>--- Event Stage 0: Main Stage<br><br> Matrix 37 30 2867511840 0.<br> Viewer 2 0 0 0.<br> Vector 4 1 1792 0.<br> Index Set 2 2 1495248 0.<br> Star Forest Graph 3 0 0 0.<br>========================================================================================================================<br>Average time to get PetscTime(): 3.83e-08<br>Average time for MPI_Barrier(): 7.874e-07<br>Average time for zero size MPI_Send(): 3.4035e-06<br>#PETSc Option Table entries:<br>-bench spmm<br>-enable_gpu<br>-log_view<br>-mat_type aijcusparse<br>-matload_block_size 1<br>-matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc<br>-n 20<br>-vec_type cuda<br>-warmup 10
```</pre><pre style="overflow-wrap: break-word;"><br></pre><pre style="overflow-wrap: break-word;">Thanks,</pre><pre style="overflow-wrap: break-word;"><br></pre><pre style="overflow-wrap: break-word;">Rohan Yadav</pre><pre style="overflow-wrap: break-word;"><font face="Arial, Helvetica, sans-serif"><span style="white-space:normal"><br></span></font></pre></div></div>