<div dir="ltr">A log_view is attached at the end of the mail.<div><br></div><div>I am running on a large problem size (639 million nonzeros).</div><div><br></div><div>> * I assume you are assembling the matrix on the CPU. The copy of data to the GPU takes time and you really should be creating the matrix on the GPU</div><br class="gmail-Apple-interchange-newline"><div>How do I do this? I'm loading the matrix in from a file, but I'm running the computation several times (and with a warmup), so I would expect that the data is copied onto the GPU the first time. My (cpu) code to do this is here: <a href="https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68">https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68</a>.</div><div><br></div><div>Log view:</div><div><br></div><div>---------------------------------------------- PETSc Performance Summary: ----------------------------------------------<br><br>./bin/benchmark on a named lassen75 with 2 processors, by yadav2 Fri Jan 14 13:54:09 2022<br>Using Petsc Release Version 3.16.3, unknown<br><br> Max Max/Min Avg Total<br>Time (sec): 1.026e+02 1.000 1.026e+02<br>Objects: 1.200e+01 1.000 1.200e+01<br>Flop: 1.156e+10 1.009 1.151e+10 2.303e+10<br>Flop/sec: 1.127e+08 1.009 1.122e+08 2.245e+08<br>MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01<br>MPI Message Lengths: 2.210e+10 1.000 6.313e+08 4.419e+10<br>MPI Reductions: 4.100e+01 1.000<br><br>Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)<br> e.g., VecAXPY() for real vectors of length N --> 2N flop<br> and VecAXPY() for complex vectors of length N --> 8N flop<br><br>Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --<br> Avg %Total Avg %Total Count %Total Avg %Total Count %Total<br> 0: Main Stage: 1.0257e+02 100.0% 2.3025e+10 100.0% 7.000e+01 100.0% 6.313e+08 100.0% 2.300e+01 56.1%<br><br>------------------------------------------------------------------------------------------------------------------------<br>See the 'Profiling' chapter of the users' manual for details on interpreting output.<br>Phase summary info:<br> Count: number of times phase was executed<br> Time and Flop: Max - maximum over all processors<br> Ratio - ratio of maximum to minimum over all processors<br> Mess: number of messages sent<br> AvgLen: average message length (bytes)<br> Reduct: number of global reductions<br> Global: entire computation<br> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().<br> %T - percent time in this phase %F - percent flop in this phase<br> %M - percent messages in this phase %L - percent message lengths in this phase<br> %R - percent reductions in this phase<br> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)<br> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)<br> CpuToGpu Count: total number of CPU to GPU copies per processor<br> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)<br> GpuToCpu Count: total number of GPU to CPU copies per processor<br> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)<br> GPU %F: percent flops on GPU in this event<br>------------------------------------------------------------------------------------------------------------------------<br>Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU<br> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F<br>---------------------------------------------------------------------------------------------------------------------------------------------------------------<br><br>--- Event Stage 0: Main Stage<br><br>BuildTwoSided 1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00 4.0e+00 1.0e+00 0 0 3 0 2 0 0 3 0 4 0 0 0 0.00e+00 0 0.00e+00 0<br>MatMult 30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 6.4e+08 1.0e+00 65100 91 93 2 65100 91 93 4 346 0 0 0.00e+00 31 2.65e+04 0<br>MatAssemblyBegin 1 1.0 3.1100e-07 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>MatAssemblyEnd 1 1.0 1.9798e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00 19 0 0 0 10 19 0 0 0 17 0 0 0 0.00e+00 0 0.00e+00 0<br>MatLoad 1 1.0 3.5519e+01 1.0 0.00e+00 0.0 6.0e+00 5.4e+08 1.6e+01 35 0 9 7 39 35 0 9 7 70 0 0 0 0.00e+00 0 0.00e+00 0<br>VecSet 5 1.0 5.8959e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecScatterBegin 30 1.0 5.4085e+00 1.0 0.00e+00 0.0 6.4e+01 6.4e+08 1.0e+00 5 0 91 93 2 5 0 91 93 4 0 0 0 0.00e+00 0 0.00e+00 0<br>VecScatterEnd 30 1.0 9.2544e+00 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecCUDACopyFrom 31 1.0 4.0174e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 31 2.65e+04 0<br>SFSetGraph 1 1.0 4.4912e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>SFSetUp 1 1.0 5.2595e+00 1.0 0.00e+00 0.0 4.0e+00 1.7e+08 1.0e+00 5 0 6 2 2 5 0 6 2 4 0 0 0 0.00e+00 0 0.00e+00 0<br>SFPack 30 1.0 3.4021e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>SFUnpack 30 1.0 1.9222e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>---------------------------------------------------------------------------------------------------------------------------------------------------------------<br><br>Memory usage is given in bytes:<br><br>Object Type Creations Destructions Memory Descendants' Mem.<br>Reports information only for process 0.<br><br>--- Event Stage 0: Main Stage<br><br> Matrix 3 0 0 0.<br> Viewer 2 0 0 0.<br> Vector 4 1 1792 0.<br> Index Set 2 2 335250404 0.<br> Star Forest Graph 1 0 0 0.<br>========================================================================================================================<br>Average time to get PetscTime(): 3.77e-08<br>Average time for MPI_Barrier(): 8.754e-07<br>Average time for zero size MPI_Send(): 2.6755e-06<br>#PETSc Option Table entries:<br>-log_view<br>-mat_type aijcusparse<br>-matrix /p/gpfs1/yadav2/tensors//petsc/kmer_V1r.petsc<br>-n 20<br>-vec_type cuda<br>-warmup 10<br>#End of PETSc Option Table entries<br>Compiled without FORTRAN kernels<br>Compiled with full precision matrices (default)<br>sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4<br>Configure options: --download-c2html=0 --download-hwloc=0 --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0 --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc --with-clanguage=C --with-cxx-dialect=C++17 --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++ --with-cuda=1 --with-debugging=0 --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran --with-fftw=0 --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4 --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= PETSC_ARCH=arch-linux-c-opt<br>-----------------------------------------<br>Libraries compiled on 2022-01-14 20:56:04 on lassen99<br>Machine characteristics: Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo<br>Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install<br>Using PETSc arch:<br>-----------------------------------------<br><br>Using C compiler: /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc -g -DNoChange -fPIC "-O3"<br>Using Fortran compiler: /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran -g -fPIC<br>-----------------------------------------<br><br>Using include paths: -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include<br>-----------------------------------------<br><br>Using C linker: /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc<br>Using Fortran linker: /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran<br>Using libraries: -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread -lquadmath -lstdc++ -ldl<br>-----------------------------------------<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 14, 2022 at 1:43 PM Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">There are a few things:<div>* GPU have higher latencies and so you basically need a large enough problem to get GPU speedup</div><div>* I assume you are assembling the matrix on the CPU. The copy of data to the GPU takes time and you really should be creating the matrix on the GPU</div><div>* I agree with Barry, Roughly 1M / GPU is around where you start seeing a win but this depends on a lot of things.</div><div>* There are startup costs, like the CPU-GPU copy. It is best to run one mat-vec, or whatever, push a new stage and then run the benchmark. The timing for this new stage will be separate in the log view data. Look at that.</div><div> - You can fake this by running your benchmark many times to amortize any setup costs.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 14, 2022 at 4:27 PM Rohan Yadav <<a href="mailto:rohany@alumni.cmu.edu" target="_blank">rohany@alumni.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>I'm looking to use PETSc with GPUs to do some linear algebra operations, like SpMV, SPMM etc. Building PETSc with `--with-cuda=1` and running with `-mat_type aijcusparse -vec_type cuda` gives me a large slowdown from the same code running on the CPU. This is not entirely unexpected, as things like data transfer costs across the PCIE might erroneously be included in my timing. Are there some examples of benchmarking GPU computations with PETSc, or just the proper way to write code in PETSc that will work for CPUs and GPUs?</div><div><br></div><div>Rohan</div></div>
</blockquote></div>
</blockquote></div>