[petsc-dev] Using PETSC with GPUs

Thu Jan 20 15:07:27 CST 2022

Another small question -- I'm a little confused around timing GPU codes
with PETSc. I have a code that looks like:
```
start = now()
for (int i = 0; i < 10; i++) {
    MatMult(A, x, y);
}
end = now()
print(end - start / 10)
```

If I run this program with `-vec_type cuda -mat_type aijcusparse`, the GPUs
are indeed utilized, but the recorded time is very tiny (i imagine just
tracking the cost of launching cuda kernels). However, if I add `-log_view`
to the command line arguments, then the resulting time printed matches what
is recorded by `nvprof`. What is the correct way to benchmark PETSc with
GPUs without having -log_view turned on?

Thanks,

Rohan

On Sat, Jan 15, 2022 at 7:37 AM Barry Smith <bsmith at petsc.dev> wrote:

>
>   Oh yes, you are correct for this operation since the handling of
> different nonzero pattern is not trivial to implement well for the GPU.
>
> On Jan 15, 2022, at 1:17 AM, Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>
> Scanning the source code for mpiseqaijcusparse confirms my thoughts --
> when used with DIFFERENT_NONZERO_PATTERN, it falls back to calling
> MatAXPY_SeqAIJ, copying the data back over to the host.
>
> Rohan
>
> On Fri, Jan 14, 2022 at 10:16 PM Rohan Yadav <rohany at alumni.cmu.edu>
> wrote:
>
>>
>>
>> ---------- Forwarded message ---------
>> From: Rohan Yadav <rohany at alumni.cmu.edu>
>> Date: Fri, Jan 14, 2022 at 10:03 PM
>> Subject: Re: [petsc-dev] Using PETSC with GPUs
>> To: Barry Smith <bsmith at petsc.dev>
>>
>>
>> Ok, I'll try looking with greps like and see what I find.
>>
>> >  My guess why your code is not using the seqaijcusparse is that you are
>> not setting the type before you call MatLoad() hence it loads with SeqAIJ.
>> -mat_type does not magically change a type once a matrix has a set type. I
>> agree our documentation on how to make objects be GPU objects is horrible
>> now.
>>
>> I printed out my matrices with the PetscViewer objects and can confirm
>> that the type is seqaijcusparse. Perhaps for the way I'm using it
>> (DIFFERENT_NONZERO_PATTERN) the kernel is unsupported? I'm not sure how to
>> get any more diagnostic info about why the cuda kernel isn't called...
>>
>> Rohan
>>
>> On Fri, Jan 14, 2022 at 9:46 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>   This changes rapidly and depends on if the backend is CUDA, HIP, Sycl,
>>> or Kokkos. The only way to find out definitively is with, for example,
>>>
>>> git grep MatMult_ | egrep -i "(cusparse|cublas|cuda)"
>>>
>>>
>>>   Because of our, unfortunately, earlier naming choices you need to kind
>>> of know what to grep for, for CUDA it may be cuSparse or cuBLAS
>>>
>>>   Not yet merged branches may also have some operations that are still
>>> being developed.
>>>
>>>   My guess why your code is not using the seqaijcusparse is that you are
>>> not setting the type before you call MatLoad() hence it loads with SeqAIJ.
>>> -mat_type does not magically change a type once a matrix has a set type. I
>>> agree our documentation on how to make objects be GPU objects is horrible
>>> now.
>>>
>>>   Barry
>>>
>>>
>>> On Jan 15, 2022, at 12:31 AM, Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>>>
>>> I was wondering if there is a definitive list for what operations are
>>> and aren't supported for distributed GPU execution. For some operations,
>>> like `MatMult`, it is clear that MPIAIJCUSPARSE implements MatMult from the
>>> documentation, but other operations it is unclear, such as MatMatMult.
>>> Another scenario is the MatAXPY kernel, which supposedly has a
>>> SeqAIJCUSPARSE implementation, which I take means that it can only execute
>>> on a single GPU. However, even if I pass -mat_type seqaijcusparse to the
>>> kernel it doesn't seem to utilize the GPU.
>>>
>>> Rohan
>>>
>>> On Fri, Jan 14, 2022 at 4:05 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>>
>>>>   Just use 1 MPI rank.
>>>>
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event                Count      Time (sec)     Flop
>>>>          --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
>>>> GpuToCpu - GPU
>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
>>>> Count   Size  %F
>>>>
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> BuildTwoSided          1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00
>>>> 4.0e+00 1.0e+00  0  0  3  0  2   0  0  3  0  4     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> MatMult               30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01
>>>> 6.4e+08 1.0e+00 65100 91 93  2  65100 91 93  4   346       0      0
>>>> 0.00e+00   31 2.65e+04  0
>>>>
>>>> From this it is clear the matrix never ended up on the GPU, but the
>>>> vector did. For each multiply, it is copying the vector from the GPU to the
>>>> CPU and then doing the MatMult on the CPU. If the MatMult was done on the
>>>> GPU the file number in the row would be 100% indicating all the flops were
>>>> done on the GPU and the fifth from the end value of 0 would be some large
>>>> number, being the flop rate on the GPU.
>>>>
>>>>
>>>>
>>>> On Jan 14, 2022, at 4:59 PM, Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>>>>
>>>> A log_view is attached at the end of the mail.
>>>>
>>>> I am running on a large problem size (639 million nonzeros).
>>>>
>>>> > * I assume you are assembling the matrix on the CPU. The copy of data
>>>> to the GPU takes time and you really should be creating the matrix on the
>>>> GPU
>>>>
>>>> How do I do this? I'm loading the matrix in from a file, but I'm
>>>> running the computation several times (and with a warmup), so I would
>>>> expect that the data is copied onto the GPU the first time. My (cpu) code
>>>> to do this is here:
>>>> https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68
>>>> .
>>>>
>>>> Log view:
>>>>
>>>> ---------------------------------------------- PETSc Performance
>>>> Summary: ----------------------------------------------
>>>>
>>>> ./bin/benchmark on a  named lassen75 with 2 processors, by yadav2 Fri
>>>> Jan 14 13:54:09 2022
>>>> Using Petsc Release Version 3.16.3, unknown
>>>>
>>>>                          Max       Max/Min     Avg       Total
>>>> Time (sec):           1.026e+02     1.000   1.026e+02
>>>> Objects:              1.200e+01     1.000   1.200e+01
>>>> Flop:                 1.156e+10     1.009   1.151e+10  2.303e+10
>>>> Flop/sec:             1.127e+08     1.009   1.122e+08  2.245e+08
>>>> MPI Messages:         3.500e+01     1.000   3.500e+01  7.000e+01
>>>> MPI Message Lengths:  2.210e+10     1.000   6.313e+08  4.419e+10
>>>> MPI Reductions:       4.100e+01     1.000
>>>>
>>>> Flop counting convention: 1 flop = 1 real number operation of type
>>>> (multiply/divide/add/subtract)
>>>>                             e.g., VecAXPY() for real vectors of length
>>>> N --> 2N flop
>>>>                             and VecAXPY() for complex vectors of length
>>>> N --> 8N flop
>>>>
>>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages
>>>> ---  -- Message Lengths --  -- Reductions --
>>>>                         Avg     %Total     Avg     %Total    Count
>>>> %Total     Avg         %Total    Count   %Total
>>>>  0:      Main Stage: 1.0257e+02 100.0%  2.3025e+10 100.0%  7.000e+01
>>>> 100.0%  6.313e+08      100.0%  2.300e+01  56.1%
>>>>
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> See the 'Profiling' chapter of the users' manual for details on
>>>> interpreting output.
>>>> Phase summary info:
>>>>    Count: number of times phase was executed
>>>>    Time and Flop: Max - maximum over all processors
>>>>                   Ratio - ratio of maximum to minimum over all
>>>> processors
>>>>    Mess: number of messages sent
>>>>    AvgLen: average message length (bytes)
>>>>    Reduct: number of global reductions
>>>>    Global: entire computation
>>>>    Stage: stages of a computation. Set stages with PetscLogStagePush()
>>>> and PetscLogStagePop().
>>>>       %T - percent time in this phase         %F - percent flop in this
>>>> phase
>>>>       %M - percent messages in this phase     %L - percent message
>>>> lengths in this phase
>>>>       %R - percent reductions in this phase
>>>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
>>>> over all processors)
>>>>    GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max
>>>> GPU time over all processors)
>>>>    CpuToGpu Count: total number of CPU to GPU copies per processor
>>>>    CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
>>>> processor)
>>>>    GpuToCpu Count: total number of GPU to CPU copies per processor
>>>>    GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
>>>> processor)
>>>>    GPU %F: percent flops on GPU in this event
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event                Count      Time (sec)     Flop
>>>>          --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
>>>> GpuToCpu - GPU
>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
>>>> Count   Size  %F
>>>>
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> BuildTwoSided          1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00
>>>> 4.0e+00 1.0e+00  0  0  3  0  2   0  0  3  0  4     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> MatMult               30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01
>>>> 6.4e+08 1.0e+00 65100 91 93  2  65100 91 93  4   346       0      0
>>>> 0.00e+00   31 2.65e+04  0
>>>> MatAssemblyBegin       1 1.0 3.1100e-07 1.1 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> MatAssemblyEnd         1 1.0 1.9798e+01 1.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 4.0e+00 19  0  0  0 10  19  0  0  0 17     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> MatLoad                1 1.0 3.5519e+01 1.0 0.00e+00 0.0 6.0e+00
>>>> 5.4e+08 1.6e+01 35  0  9  7 39  35  0  9  7 70     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> VecSet                 5 1.0 5.8959e-02 1.1 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> VecScatterBegin       30 1.0 5.4085e+00 1.0 0.00e+00 0.0 6.4e+01
>>>> 6.4e+08 1.0e+00  5  0 91 93  2   5  0 91 93  4     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> VecScatterEnd         30 1.0 9.2544e+00 2.5 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  6  0  0  0  0   6  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> VecCUDACopyFrom       31 1.0 4.0174e-01 1.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00   31 2.65e+04  0
>>>> SFSetGraph             1 1.0 4.4912e-02 1.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> SFSetUp                1 1.0 5.2595e+00 1.0 0.00e+00 0.0 4.0e+00
>>>> 1.7e+08 1.0e+00  5  0  6  2  2   5  0  6  2  4     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> SFPack                30 1.0 3.4021e-02 1.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> SFUnpack              30 1.0 1.9222e-05 1.5 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>>
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> Memory usage is given in bytes:
>>>>
>>>> Object Type          Creations   Destructions     Memory  Descendants'
>>>> Mem.
>>>> Reports information only for process 0.
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>>               Matrix     3              0            0     0.
>>>>               Viewer     2              0            0     0.
>>>>               Vector     4              1         1792     0.
>>>>            Index Set     2              2    335250404     0.
>>>>    Star Forest Graph     1              0            0     0.
>>>>
>>>> ========================================================================================================================
>>>> Average time to get PetscTime(): 3.77e-08
>>>> Average time for MPI_Barrier(): 8.754e-07
>>>> Average time for zero size MPI_Send(): 2.6755e-06
>>>> #PETSc Option Table entries:
>>>> -log_view
>>>> -mat_type aijcusparse
>>>> -matrix /p/gpfs1/yadav2/tensors//petsc/kmer_V1r.petsc
>>>> -n 20
>>>> -vec_type cuda
>>>> -warmup 10
>>>> #End of PETSc Option Table entries
>>>> Compiled without FORTRAN kernels
>>>> Compiled with full precision matrices (default)
>>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>>> Configure options: --download-c2html=0 --download-hwloc=0
>>>> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0
>>>> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so
>>>> /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so"
>>>> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>>> --with-clanguage=C --with-cxx-dialect=C++17
>>>> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++
>>>> --with-cuda=1 --with-debugging=0
>>>> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>>> --with-fftw=0
>>>> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4
>>>> --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0
>>>> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0
>>>> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0
>>>> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so
>>>> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3"
>>>> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS=
>>>> PETSC_ARCH=arch-linux-c-opt
>>>> -----------------------------------------
>>>> Libraries compiled on 2022-01-14 20:56:04 on lassen99
>>>> Machine characteristics:
>>>> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo
>>>> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install
>>>> Using PETSc arch:
>>>> -----------------------------------------
>>>>
>>>> Using C compiler:
>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>>> -g -DNoChange -fPIC "-O3"
>>>> Using Fortran compiler:
>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>>> -g -fPIC
>>>> -----------------------------------------
>>>>
>>>> Using include paths:
>>>> -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include
>>>> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include
>>>> -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include
>>>> -----------------------------------------
>>>>
>>>> Using C linker:
>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>>> Using Fortran linker:
>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>>> Using libraries:
>>>> -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib
>>>> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc
>>>> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
>>>> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
>>>> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>>> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>>> -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64
>>>> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64
>>>> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
>>>> -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib
>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl
>>>> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse
>>>> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi
>>>> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath
>>>> -lpthread -lquadmath -lstdc++ -ldl
>>>> -----------------------------------------
>>>>
>>>> On Fri, Jan 14, 2022 at 1:43 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>> There are a few things:
>>>>> * GPU have higher latencies and so you basically need a large
>>>>> enough problem to get GPU speedup
>>>>> * I assume you are assembling the matrix on the CPU. The copy of data
>>>>> to the GPU takes time and you really should be creating the matrix on the
>>>>> GPU
>>>>> * I agree with Barry, Roughly 1M / GPU is around where you start
>>>>> seeing a win but this depends on a lot of things.
>>>>> * There are startup costs, like the CPU-GPU copy. It is best to run
>>>>> one mat-vec, or whatever, push a new stage and then run the benchmark. The
>>>>> timing for this new stage will be separate in the log view data. Look at
>>>>> that.
>>>>>  - You can fake this by running your benchmark many times to amortize
>>>>> any setup costs.
>>>>>
>>>>> On Fri, Jan 14, 2022 at 4:27 PM Rohan Yadav <rohany at alumni.cmu.edu>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm looking to use PETSc with GPUs to do some linear algebra
>>>>>> operations, like SpMV, SPMM etc. Building PETSc with `--with-cuda=1` and
>>>>>> running with `-mat_type aijcusparse -vec_type cuda` gives me a large
>>>>>> slowdown from the same code running on the CPU. This is not entirely
>>>>>> unexpected, as things like data transfer costs across the PCIE might
>>>>>> erroneously be included in my timing. Are there some examples of
>>>>>> benchmarking GPU computations with PETSc, or just the proper way to write
>>>>>> code in PETSc that will work for CPUs and GPUs?
>>>>>>
>>>>>> Rohan
>>>>>>
>>>>>
>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220120/11bd18d0/attachment-0001.html>