[petsc-dev] Using PETSC with GPUs

Fri Jan 14 15:59:42 CST 2022

A log_view is attached at the end of the mail.

I am running on a large problem size (639 million nonzeros).

> * I assume you are assembling the matrix on the CPU. The copy of data to
the GPU takes time and you really should be creating the matrix on the GPU

How do I do this? I'm loading the matrix in from a file, but I'm running
the computation several times (and with a warmup), so I would expect that
the data is copied onto the GPU the first time. My (cpu) code to do this is
here:
https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68
.

Log view:

---------------------------------------------- PETSc Performance Summary:
----------------------------------------------

./bin/benchmark on a  named lassen75 with 2 processors, by yadav2 Fri Jan
14 13:54:09 2022
Using Petsc Release Version 3.16.3, unknown

                         Max       Max/Min     Avg       Total
Time (sec):           1.026e+02     1.000   1.026e+02
Objects:              1.200e+01     1.000   1.200e+01
Flop:                 1.156e+10     1.009   1.151e+10  2.303e+10
Flop/sec:             1.127e+08     1.009   1.122e+08  2.245e+08
MPI Messages:         3.500e+01     1.000   3.500e+01  7.000e+01
MPI Message Lengths:  2.210e+10     1.000   6.313e+08  4.419e+10
MPI Reductions:       4.100e+01     1.000

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N
--> 2N flop
                            and VecAXPY() for complex vectors of length N
--> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---
 -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total
    Avg         %Total    Count   %Total
 0:      Main Stage: 1.0257e+02 100.0%  2.3025e+10 100.0%  7.000e+01 100.0%
 6.313e+08      100.0%  2.300e+01  56.1%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this
phase
      %M - percent messages in this phase     %L - percent message lengths
in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
all processors)
   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU
time over all processors)
   CpuToGpu Count: total number of CPU to GPU copies per processor
   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
processor)
   GpuToCpu Count: total number of GPU to CPU copies per processor
   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
processor)
   GPU %F: percent flops on GPU in this event
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop
     --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
GpuToCpu - GPU
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
 Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
Count   Size  %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSided          1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00 4.0e+00
1.0e+00  0  0  3  0  2   0  0  3  0  4     0       0      0 0.00e+00    0
0.00e+00  0
MatMult               30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 6.4e+08
1.0e+00 65100 91 93  2  65100 91 93  4   346       0      0 0.00e+00   31
2.65e+04  0
MatAssemblyBegin       1 1.0 3.1100e-07 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0
MatAssemblyEnd         1 1.0 1.9798e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
4.0e+00 19  0  0  0 10  19  0  0  0 17     0       0      0 0.00e+00    0
0.00e+00  0
MatLoad                1 1.0 3.5519e+01 1.0 0.00e+00 0.0 6.0e+00 5.4e+08
1.6e+01 35  0  9  7 39  35  0  9  7 70     0       0      0 0.00e+00    0
0.00e+00  0
VecSet                 5 1.0 5.8959e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0
VecScatterBegin       30 1.0 5.4085e+00 1.0 0.00e+00 0.0 6.4e+01 6.4e+08
1.0e+00  5  0 91 93  2   5  0 91 93  4     0       0      0 0.00e+00    0
0.00e+00  0
VecScatterEnd         30 1.0 9.2544e+00 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  6  0  0  0  0   6  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0
VecCUDACopyFrom       31 1.0 4.0174e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00   31
2.65e+04  0
SFSetGraph             1 1.0 4.4912e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0
SFSetUp                1 1.0 5.2595e+00 1.0 0.00e+00 0.0 4.0e+00 1.7e+08
1.0e+00  5  0  6  2  2   5  0  6  2  4     0       0      0 0.00e+00    0
0.00e+00  0
SFPack                30 1.0 3.4021e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0
SFUnpack              30 1.0 1.9222e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
0.00e+00  0
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Matrix     3              0            0     0.
              Viewer     2              0            0     0.
              Vector     4              1         1792     0.
           Index Set     2              2    335250404     0.
   Star Forest Graph     1              0            0     0.
========================================================================================================================
Average time to get PetscTime(): 3.77e-08
Average time for MPI_Barrier(): 8.754e-07
Average time for zero size MPI_Send(): 2.6755e-06
#PETSc Option Table entries:
-log_view
-mat_type aijcusparse
-matrix /p/gpfs1/yadav2/tensors//petsc/kmer_V1r.petsc
-n 20
-vec_type cuda
-warmup 10
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --download-c2html=0 --download-hwloc=0
--download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0
--with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so
/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so"
--with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
--with-clanguage=C --with-cxx-dialect=C++17
--with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++
--with-cuda=1 --with-debugging=0
--with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
--with-fftw=0
--with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4
--with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0
--with-scalar-type=real --with-shared-libraries=1 --with-ssl=0
--with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0
--with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so
--with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3"
CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS=
PETSC_ARCH=arch-linux-c-opt
-----------------------------------------
Libraries compiled on 2022-01-14 20:56:04 on lassen99
Machine characteristics:
Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo
Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install
Using PETSc arch:
-----------------------------------------

Using C compiler:
/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
-g -DNoChange -fPIC "-O3"
Using Fortran compiler:
/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
-g -fPIC
-----------------------------------------

Using include paths: -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include
-I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include
-I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include
-----------------------------------------

Using C linker:
/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
Using Fortran linker:
/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
Using libraries:
-Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib
-L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc
-Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
-L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
-Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
-L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
-Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64
-L/usr/tce/packages/cuda/cuda-11.1.0/lib64
-Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
-L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
-Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
-L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
-Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
-L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
-Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
-L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
-Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib
-L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl
-lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse
-lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi
-lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath
-lpthread -lquadmath -lstdc++ -ldl
-----------------------------------------

On Fri, Jan 14, 2022 at 1:43 PM Mark Adams <mfadams at lbl.gov> wrote:

> There are a few things:
> * GPU have higher latencies and so you basically need a large
> enough problem to get GPU speedup
> * I assume you are assembling the matrix on the CPU. The copy of data to
> the GPU takes time and you really should be creating the matrix on the GPU
> * I agree with Barry, Roughly 1M / GPU is around where you start seeing a
> win but this depends on a lot of things.
> * There are startup costs, like the CPU-GPU copy. It is best to run one
> mat-vec, or whatever, push a new stage and then run the benchmark. The
> timing for this new stage will be separate in the log view data. Look at
> that.
>  - You can fake this by running your benchmark many times to amortize any
> setup costs.
>
> On Fri, Jan 14, 2022 at 4:27 PM Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>
>> Hi,
>>
>> I'm looking to use PETSc with GPUs to do some linear algebra operations,
>> like SpMV, SPMM etc. Building PETSc with `--with-cuda=1` and running with
>> `-mat_type aijcusparse -vec_type cuda` gives me a large slowdown from the
>> same code running on the CPU. This is not entirely unexpected, as things
>> like data transfer costs across the PCIE might erroneously be included in
>> my timing. Are there some examples of benchmarking GPU computations with
>> PETSc, or just the proper way to write code in PETSc that will work for
>> CPUs and GPUs?
>>
>> Rohan
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220114/bf66cfd1/attachment.html>