[petsc-dev] Using PETSC with GPUs
Rohan Yadav
rohany at alumni.cmu.edu
Sat Jan 15 00:17:23 CST 2022
Scanning the source code for mpiseqaijcusparse confirms my thoughts -- when
used with DIFFERENT_NONZERO_PATTERN, it falls back to calling
MatAXPY_SeqAIJ, copying the data back over to the host.
Rohan
On Fri, Jan 14, 2022 at 10:16 PM Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>
>
> ---------- Forwarded message ---------
> From: Rohan Yadav <rohany at alumni.cmu.edu>
> Date: Fri, Jan 14, 2022 at 10:03 PM
> Subject: Re: [petsc-dev] Using PETSC with GPUs
> To: Barry Smith <bsmith at petsc.dev>
>
>
> Ok, I'll try looking with greps like and see what I find.
>
> > My guess why your code is not using the seqaijcusparse is that you are
> not setting the type before you call MatLoad() hence it loads with SeqAIJ.
> -mat_type does not magically change a type once a matrix has a set type. I
> agree our documentation on how to make objects be GPU objects is horrible
> now.
>
> I printed out my matrices with the PetscViewer objects and can confirm
> that the type is seqaijcusparse. Perhaps for the way I'm using it
> (DIFFERENT_NONZERO_PATTERN) the kernel is unsupported? I'm not sure how to
> get any more diagnostic info about why the cuda kernel isn't called...
>
> Rohan
>
> On Fri, Jan 14, 2022 at 9:46 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>> This changes rapidly and depends on if the backend is CUDA, HIP, Sycl,
>> or Kokkos. The only way to find out definitively is with, for example,
>>
>> git grep MatMult_ | egrep -i "(cusparse|cublas|cuda)"
>>
>>
>> Because of our, unfortunately, earlier naming choices you need to kind
>> of know what to grep for, for CUDA it may be cuSparse or cuBLAS
>>
>> Not yet merged branches may also have some operations that are still
>> being developed.
>>
>> My guess why your code is not using the seqaijcusparse is that you are
>> not setting the type before you call MatLoad() hence it loads with SeqAIJ.
>> -mat_type does not magically change a type once a matrix has a set type. I
>> agree our documentation on how to make objects be GPU objects is horrible
>> now.
>>
>> Barry
>>
>>
>> On Jan 15, 2022, at 12:31 AM, Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>>
>> I was wondering if there is a definitive list for what operations are and
>> aren't supported for distributed GPU execution. For some operations, like
>> `MatMult`, it is clear that MPIAIJCUSPARSE implements MatMult from the
>> documentation, but other operations it is unclear, such as MatMatMult.
>> Another scenario is the MatAXPY kernel, which supposedly has a
>> SeqAIJCUSPARSE implementation, which I take means that it can only execute
>> on a single GPU. However, even if I pass -mat_type seqaijcusparse to the
>> kernel it doesn't seem to utilize the GPU.
>>
>> Rohan
>>
>> On Fri, Jan 14, 2022 at 4:05 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>> Just use 1 MPI rank.
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event Count Time (sec) Flop
>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - -
>>> GpuToCpu - GPU
>>> Max Ratio Max Ratio Max Ratio Mess AvgLen
>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size
>>> Count Size %F
>>>
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> BuildTwoSided 1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00
>>> 4.0e+00 1.0e+00 0 0 3 0 2 0 0 3 0 4 0 0 0
>>> 0.00e+00 0 0.00e+00 0
>>> MatMult 30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 6.4e+08
>>> 1.0e+00 65100 91 93 2 65100 91 93 4 346 0 0 0.00e+00 31
>>> 2.65e+04 0
>>>
>>> From this it is clear the matrix never ended up on the GPU, but the
>>> vector did. For each multiply, it is copying the vector from the GPU to the
>>> CPU and then doing the MatMult on the CPU. If the MatMult was done on the
>>> GPU the file number in the row would be 100% indicating all the flops were
>>> done on the GPU and the fifth from the end value of 0 would be some large
>>> number, being the flop rate on the GPU.
>>>
>>>
>>>
>>> On Jan 14, 2022, at 4:59 PM, Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>>>
>>> A log_view is attached at the end of the mail.
>>>
>>> I am running on a large problem size (639 million nonzeros).
>>>
>>> > * I assume you are assembling the matrix on the CPU. The copy of data
>>> to the GPU takes time and you really should be creating the matrix on the
>>> GPU
>>>
>>> How do I do this? I'm loading the matrix in from a file, but I'm running
>>> the computation several times (and with a warmup), so I would expect that
>>> the data is copied onto the GPU the first time. My (cpu) code to do this is
>>> here:
>>> https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68
>>> .
>>>
>>> Log view:
>>>
>>> ---------------------------------------------- PETSc Performance
>>> Summary: ----------------------------------------------
>>>
>>> ./bin/benchmark on a named lassen75 with 2 processors, by yadav2 Fri
>>> Jan 14 13:54:09 2022
>>> Using Petsc Release Version 3.16.3, unknown
>>>
>>> Max Max/Min Avg Total
>>> Time (sec): 1.026e+02 1.000 1.026e+02
>>> Objects: 1.200e+01 1.000 1.200e+01
>>> Flop: 1.156e+10 1.009 1.151e+10 2.303e+10
>>> Flop/sec: 1.127e+08 1.009 1.122e+08 2.245e+08
>>> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01
>>> MPI Message Lengths: 2.210e+10 1.000 6.313e+08 4.419e+10
>>> MPI Reductions: 4.100e+01 1.000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>> e.g., VecAXPY() for real vectors of length N
>>> --> 2N flop
>>> and VecAXPY() for complex vectors of length
>>> N --> 8N flop
>>>
>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages
>>> --- -- Message Lengths -- -- Reductions --
>>> Avg %Total Avg %Total Count
>>> %Total Avg %Total Count %Total
>>> 0: Main Stage: 1.0257e+02 100.0% 2.3025e+10 100.0% 7.000e+01
>>> 100.0% 6.313e+08 100.0% 2.300e+01 56.1%
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.
>>> Phase summary info:
>>> Count: number of times phase was executed
>>> Time and Flop: Max - maximum over all processors
>>> Ratio - ratio of maximum to minimum over all processors
>>> Mess: number of messages sent
>>> AvgLen: average message length (bytes)
>>> Reduct: number of global reductions
>>> Global: entire computation
>>> Stage: stages of a computation. Set stages with PetscLogStagePush()
>>> and PetscLogStagePop().
>>> %T - percent time in this phase %F - percent flop in this
>>> phase
>>> %M - percent messages in this phase %L - percent message
>>> lengths in this phase
>>> %R - percent reductions in this phase
>>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
>>> over all processors)
>>> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max
>>> GPU time over all processors)
>>> CpuToGpu Count: total number of CPU to GPU copies per processor
>>> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
>>> processor)
>>> GpuToCpu Count: total number of GPU to CPU copies per processor
>>> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
>>> processor)
>>> GPU %F: percent flops on GPU in this event
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event Count Time (sec) Flop
>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - -
>>> GpuToCpu - GPU
>>> Max Ratio Max Ratio Max Ratio Mess AvgLen
>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size
>>> Count Size %F
>>>
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> BuildTwoSided 1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00
>>> 4.0e+00 1.0e+00 0 0 3 0 2 0 0 3 0 4 0 0 0
>>> 0.00e+00 0 0.00e+00 0
>>> MatMult 30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 6.4e+08
>>> 1.0e+00 65100 91 93 2 65100 91 93 4 346 0 0 0.00e+00 31
>>> 2.65e+04 0
>>> MatAssemblyBegin 1 1.0 3.1100e-07 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>>> 0.00e+00 0
>>> MatAssemblyEnd 1 1.0 1.9798e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 4.0e+00 19 0 0 0 10 19 0 0 0 17 0 0 0 0.00e+00 0
>>> 0.00e+00 0
>>> MatLoad 1 1.0 3.5519e+01 1.0 0.00e+00 0.0 6.0e+00 5.4e+08
>>> 1.6e+01 35 0 9 7 39 35 0 9 7 70 0 0 0 0.00e+00 0
>>> 0.00e+00 0
>>> VecSet 5 1.0 5.8959e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>>> 0.00e+00 0
>>> VecScatterBegin 30 1.0 5.4085e+00 1.0 0.00e+00 0.0 6.4e+01 6.4e+08
>>> 1.0e+00 5 0 91 93 2 5 0 91 93 4 0 0 0 0.00e+00 0
>>> 0.00e+00 0
>>> VecScatterEnd 30 1.0 9.2544e+00 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 6 0 0 0 0 6 0 0 0 0 0 0 0 0.00e+00 0
>>> 0.00e+00 0
>>> VecCUDACopyFrom 31 1.0 4.0174e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 31
>>> 2.65e+04 0
>>> SFSetGraph 1 1.0 4.4912e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>>> 0.00e+00 0
>>> SFSetUp 1 1.0 5.2595e+00 1.0 0.00e+00 0.0 4.0e+00 1.7e+08
>>> 1.0e+00 5 0 6 2 2 5 0 6 2 4 0 0 0 0.00e+00 0
>>> 0.00e+00 0
>>> SFPack 30 1.0 3.4021e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>>> 0.00e+00 0
>>> SFUnpack 30 1.0 1.9222e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>>> 0.00e+00 0
>>>
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type Creations Destructions Memory Descendants'
>>> Mem.
>>> Reports information only for process 0.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> Matrix 3 0 0 0.
>>> Viewer 2 0 0 0.
>>> Vector 4 1 1792 0.
>>> Index Set 2 2 335250404 0.
>>> Star Forest Graph 1 0 0 0.
>>>
>>> ========================================================================================================================
>>> Average time to get PetscTime(): 3.77e-08
>>> Average time for MPI_Barrier(): 8.754e-07
>>> Average time for zero size MPI_Send(): 2.6755e-06
>>> #PETSc Option Table entries:
>>> -log_view
>>> -mat_type aijcusparse
>>> -matrix /p/gpfs1/yadav2/tensors//petsc/kmer_V1r.petsc
>>> -n 20
>>> -vec_type cuda
>>> -warmup 10
>>> #End of PETSc Option Table entries
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>> Configure options: --download-c2html=0 --download-hwloc=0
>>> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0
>>> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so
>>> /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so"
>>> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>> --with-clanguage=C --with-cxx-dialect=C++17
>>> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++
>>> --with-cuda=1 --with-debugging=0
>>> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>> --with-fftw=0
>>> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4
>>> --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0
>>> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0
>>> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0
>>> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so
>>> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3"
>>> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS=
>>> PETSC_ARCH=arch-linux-c-opt
>>> -----------------------------------------
>>> Libraries compiled on 2022-01-14 20:56:04 on lassen99
>>> Machine characteristics:
>>> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo
>>> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install
>>> Using PETSc arch:
>>> -----------------------------------------
>>>
>>> Using C compiler:
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>> -g -DNoChange -fPIC "-O3"
>>> Using Fortran compiler:
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>> -g -fPIC
>>> -----------------------------------------
>>>
>>> Using include paths:
>>> -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include
>>> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include
>>> -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include
>>> -----------------------------------------
>>>
>>> Using C linker:
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>> Using Fortran linker:
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>> Using libraries:
>>> -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib
>>> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc
>>> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
>>> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
>>> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>> -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64
>>> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64
>>> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
>>> -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib
>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl
>>> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse
>>> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi
>>> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath
>>> -lpthread -lquadmath -lstdc++ -ldl
>>> -----------------------------------------
>>>
>>> On Fri, Jan 14, 2022 at 1:43 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> There are a few things:
>>>> * GPU have higher latencies and so you basically need a large
>>>> enough problem to get GPU speedup
>>>> * I assume you are assembling the matrix on the CPU. The copy of data
>>>> to the GPU takes time and you really should be creating the matrix on the
>>>> GPU
>>>> * I agree with Barry, Roughly 1M / GPU is around where you start seeing
>>>> a win but this depends on a lot of things.
>>>> * There are startup costs, like the CPU-GPU copy. It is best to run one
>>>> mat-vec, or whatever, push a new stage and then run the benchmark. The
>>>> timing for this new stage will be separate in the log view data. Look at
>>>> that.
>>>> - You can fake this by running your benchmark many times to amortize
>>>> any setup costs.
>>>>
>>>> On Fri, Jan 14, 2022 at 4:27 PM Rohan Yadav <rohany at alumni.cmu.edu>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm looking to use PETSc with GPUs to do some linear algebra
>>>>> operations, like SpMV, SPMM etc. Building PETSc with `--with-cuda=1` and
>>>>> running with `-mat_type aijcusparse -vec_type cuda` gives me a large
>>>>> slowdown from the same code running on the CPU. This is not entirely
>>>>> unexpected, as things like data transfer costs across the PCIE might
>>>>> erroneously be included in my timing. Are there some examples of
>>>>> benchmarking GPU computations with PETSc, or just the proper way to write
>>>>> code in PETSc that will work for CPUs and GPUs?
>>>>>
>>>>> Rohan
>>>>>
>>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220114/3b514cf8/attachment-0001.html>
More information about the petsc-dev
mailing list