[petsc-dev] Using PETSC with GPUs

Sat Jan 15 00:42:20 CST 2022

I'm also running into an error trying to use the MatMatMult cusparse
kernel. The following snippet of code (compiled against Petsc 3.16.3) runs
correctly with no arguments, but fails with the attached error when run
with `-mat_type aijcusparse`.

Sample code:
```
#include "mpi.h"
#include "petscmat.h"
#include "petscvec.h"
#include "petsc.h"
#include "petscsys.h"
#include "petsctime.h"

static const char help[] = "Petsc SpMM bug";

int main(int argc, char** argv) {
  PetscInitialize(&argc, &argv, (char *)0, help);

  int i = 5, j = 32, k = 5;

  Mat B;
  MatCreate(PETSC_COMM_WORLD, &B);
  MatSetSizes(B, PETSC_DECIDE, PETSC_DECIDE, i, k);
  MatSetFromOptions(B);
  MatSetUp(B);
  MatSetValue(B, 0, 0, 1, INSERT_VALUES);
  MatAssemblyBegin(B, MAT_FINAL_ASSEMBLY);
  MatAssemblyEnd(B, MAT_FINAL_ASSEMBLY);

  Mat A, C;
  MatCreateDenseCUDA(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, i, j,
NULL, &A);
  MatCreateDenseCUDA(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, k, j,
NULL, &C);
  MatMatMult(B, C, MAT_REUSE_MATRIX, PETSC_DEFAULT, &A);

  PetscFinalize();
}
```

Error:
```
 ** On entry to cusparseSpMM_bufferSize() dimension mismatch: matA->rows !=
matC->rows

[0]PETSC ERROR: --------------------- Error Message
--------------------------------------------------------------
[0]PETSC ERROR: GPU error
[0]PETSC ERROR: cuSPARSE error 3 (CUSPARSE_STATUS_INVALID_VALUE) : invalid
value
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.16.3, unknown
[0]PETSC ERROR: ./bin/spmm-error on a  named lassen68 by yadav2 Fri Jan 14
22:40:13 2022
[0]PETSC ERROR: Configure options --download-c2html=0 --download-hwloc=0
--download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0
--with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so
/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so"
--with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
--with-clanguage=C --with-cxx-dialect=C++17
--with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++
--with-cuda=1 --with-debugging=0
--with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
--with-fftw=0
--with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4
--with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0
--with-scalar-type=real --with-shared-libraries=1 --with-ssl=0
--with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0
--with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so
--with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3"
CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS=
PETSC_ARCH=arch-linux-c-opt
[0]PETSC ERROR: #1 MatProductNumeric_SeqAIJCUSPARSE_SeqDENSECUDA() at
/g/g15/yadav2/taco/petsc/petsc/src/mat/impls/aij/seq/seqcusparse/
aijcusparse.cu:2115
[0]PETSC ERROR: #2 MatProductNumeric() at
/g/g15/yadav2/taco/petsc/petsc/src/mat/interface/matproduct.c:695
[0]PETSC ERROR: #3 MatProduct_Private() at
/g/g15/yadav2/taco/petsc/petsc/src/mat/interface/matrix.c:9743
[0]PETSC ERROR: #4 MatMatMult() at
/g/g15/yadav2/taco/petsc/petsc/src/mat/interface/matrix.c:9795
```

Rohan

On Fri, Jan 14, 2022 at 10:17 PM Rohan Yadav <rohany at alumni.cmu.edu> wrote:

> Scanning the source code for mpiseqaijcusparse confirms my thoughts --
> when used with DIFFERENT_NONZERO_PATTERN, it falls back to calling
> MatAXPY_SeqAIJ, copying the data back over to the host.
>
> Rohan
>
> On Fri, Jan 14, 2022 at 10:16 PM Rohan Yadav <rohany at alumni.cmu.edu>
> wrote:
>
>>
>>
>> ---------- Forwarded message ---------
>> From: Rohan Yadav <rohany at alumni.cmu.edu>
>> Date: Fri, Jan 14, 2022 at 10:03 PM
>> Subject: Re: [petsc-dev] Using PETSC with GPUs
>> To: Barry Smith <bsmith at petsc.dev>
>>
>>
>> Ok, I'll try looking with greps like and see what I find.
>>
>> >  My guess why your code is not using the seqaijcusparse is that you are
>> not setting the type before you call MatLoad() hence it loads with SeqAIJ.
>> -mat_type does not magically change a type once a matrix has a set type. I
>> agree our documentation on how to make objects be GPU objects is horrible
>> now.
>>
>> I printed out my matrices with the PetscViewer objects and can confirm
>> that the type is seqaijcusparse. Perhaps for the way I'm using it
>> (DIFFERENT_NONZERO_PATTERN) the kernel is unsupported? I'm not sure how to
>> get any more diagnostic info about why the cuda kernel isn't called...
>>
>> Rohan
>>
>> On Fri, Jan 14, 2022 at 9:46 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>   This changes rapidly and depends on if the backend is CUDA, HIP, Sycl,
>>> or Kokkos. The only way to find out definitively is with, for example,
>>>
>>> git grep MatMult_ | egrep -i "(cusparse|cublas|cuda)"
>>>
>>>
>>>   Because of our, unfortunately, earlier naming choices you need to kind
>>> of know what to grep for, for CUDA it may be cuSparse or cuBLAS
>>>
>>>   Not yet merged branches may also have some operations that are still
>>> being developed.
>>>
>>>   My guess why your code is not using the seqaijcusparse is that you are
>>> not setting the type before you call MatLoad() hence it loads with SeqAIJ.
>>> -mat_type does not magically change a type once a matrix has a set type. I
>>> agree our documentation on how to make objects be GPU objects is horrible
>>> now.
>>>
>>>   Barry
>>>
>>>
>>> On Jan 15, 2022, at 12:31 AM, Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>>>
>>> I was wondering if there is a definitive list for what operations are
>>> and aren't supported for distributed GPU execution. For some operations,
>>> like `MatMult`, it is clear that MPIAIJCUSPARSE implements MatMult from the
>>> documentation, but other operations it is unclear, such as MatMatMult.
>>> Another scenario is the MatAXPY kernel, which supposedly has a
>>> SeqAIJCUSPARSE implementation, which I take means that it can only execute
>>> on a single GPU. However, even if I pass -mat_type seqaijcusparse to the
>>> kernel it doesn't seem to utilize the GPU.
>>>
>>> Rohan
>>>
>>> On Fri, Jan 14, 2022 at 4:05 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>>
>>>>   Just use 1 MPI rank.
>>>>
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event                Count      Time (sec)     Flop
>>>>          --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
>>>> GpuToCpu - GPU
>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
>>>> Count   Size  %F
>>>>
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> BuildTwoSided          1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00
>>>> 4.0e+00 1.0e+00  0  0  3  0  2   0  0  3  0  4     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> MatMult               30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01
>>>> 6.4e+08 1.0e+00 65100 91 93  2  65100 91 93  4   346       0      0
>>>> 0.00e+00   31 2.65e+04  0
>>>>
>>>> From this it is clear the matrix never ended up on the GPU, but the
>>>> vector did. For each multiply, it is copying the vector from the GPU to the
>>>> CPU and then doing the MatMult on the CPU. If the MatMult was done on the
>>>> GPU the file number in the row would be 100% indicating all the flops were
>>>> done on the GPU and the fifth from the end value of 0 would be some large
>>>> number, being the flop rate on the GPU.
>>>>
>>>>
>>>>
>>>> On Jan 14, 2022, at 4:59 PM, Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>>>>
>>>> A log_view is attached at the end of the mail.
>>>>
>>>> I am running on a large problem size (639 million nonzeros).
>>>>
>>>> > * I assume you are assembling the matrix on the CPU. The copy of data
>>>> to the GPU takes time and you really should be creating the matrix on the
>>>> GPU
>>>>
>>>> How do I do this? I'm loading the matrix in from a file, but I'm
>>>> running the computation several times (and with a warmup), so I would
>>>> expect that the data is copied onto the GPU the first time. My (cpu) code
>>>> to do this is here:
>>>> https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68
>>>> .
>>>>
>>>> Log view:
>>>>
>>>> ---------------------------------------------- PETSc Performance
>>>> Summary: ----------------------------------------------
>>>>
>>>> ./bin/benchmark on a  named lassen75 with 2 processors, by yadav2 Fri
>>>> Jan 14 13:54:09 2022
>>>> Using Petsc Release Version 3.16.3, unknown
>>>>
>>>>                          Max       Max/Min     Avg       Total
>>>> Time (sec):           1.026e+02     1.000   1.026e+02
>>>> Objects:              1.200e+01     1.000   1.200e+01
>>>> Flop:                 1.156e+10     1.009   1.151e+10  2.303e+10
>>>> Flop/sec:             1.127e+08     1.009   1.122e+08  2.245e+08
>>>> MPI Messages:         3.500e+01     1.000   3.500e+01  7.000e+01
>>>> MPI Message Lengths:  2.210e+10     1.000   6.313e+08  4.419e+10
>>>> MPI Reductions:       4.100e+01     1.000
>>>>
>>>> Flop counting convention: 1 flop = 1 real number operation of type
>>>> (multiply/divide/add/subtract)
>>>>                             e.g., VecAXPY() for real vectors of length
>>>> N --> 2N flop
>>>>                             and VecAXPY() for complex vectors of length
>>>> N --> 8N flop
>>>>
>>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages
>>>> ---  -- Message Lengths --  -- Reductions --
>>>>                         Avg     %Total     Avg     %Total    Count
>>>> %Total     Avg         %Total    Count   %Total
>>>>  0:      Main Stage: 1.0257e+02 100.0%  2.3025e+10 100.0%  7.000e+01
>>>> 100.0%  6.313e+08      100.0%  2.300e+01  56.1%
>>>>
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> See the 'Profiling' chapter of the users' manual for details on
>>>> interpreting output.
>>>> Phase summary info:
>>>>    Count: number of times phase was executed
>>>>    Time and Flop: Max - maximum over all processors
>>>>                   Ratio - ratio of maximum to minimum over all
>>>> processors
>>>>    Mess: number of messages sent
>>>>    AvgLen: average message length (bytes)
>>>>    Reduct: number of global reductions
>>>>    Global: entire computation
>>>>    Stage: stages of a computation. Set stages with PetscLogStagePush()
>>>> and PetscLogStagePop().
>>>>       %T - percent time in this phase         %F - percent flop in this
>>>> phase
>>>>       %M - percent messages in this phase     %L - percent message
>>>> lengths in this phase
>>>>       %R - percent reductions in this phase
>>>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
>>>> over all processors)
>>>>    GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max
>>>> GPU time over all processors)
>>>>    CpuToGpu Count: total number of CPU to GPU copies per processor
>>>>    CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
>>>> processor)
>>>>    GpuToCpu Count: total number of GPU to CPU copies per processor
>>>>    GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
>>>> processor)
>>>>    GPU %F: percent flops on GPU in this event
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event                Count      Time (sec)     Flop
>>>>          --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
>>>> GpuToCpu - GPU
>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
>>>> Count   Size  %F
>>>>
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> BuildTwoSided          1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00
>>>> 4.0e+00 1.0e+00  0  0  3  0  2   0  0  3  0  4     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> MatMult               30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01
>>>> 6.4e+08 1.0e+00 65100 91 93  2  65100 91 93  4   346       0      0
>>>> 0.00e+00   31 2.65e+04  0
>>>> MatAssemblyBegin       1 1.0 3.1100e-07 1.1 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> MatAssemblyEnd         1 1.0 1.9798e+01 1.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 4.0e+00 19  0  0  0 10  19  0  0  0 17     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> MatLoad                1 1.0 3.5519e+01 1.0 0.00e+00 0.0 6.0e+00
>>>> 5.4e+08 1.6e+01 35  0  9  7 39  35  0  9  7 70     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> VecSet                 5 1.0 5.8959e-02 1.1 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> VecScatterBegin       30 1.0 5.4085e+00 1.0 0.00e+00 0.0 6.4e+01
>>>> 6.4e+08 1.0e+00  5  0 91 93  2   5  0 91 93  4     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> VecScatterEnd         30 1.0 9.2544e+00 2.5 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  6  0  0  0  0   6  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> VecCUDACopyFrom       31 1.0 4.0174e-01 1.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00   31 2.65e+04  0
>>>> SFSetGraph             1 1.0 4.4912e-02 1.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> SFSetUp                1 1.0 5.2595e+00 1.0 0.00e+00 0.0 4.0e+00
>>>> 1.7e+08 1.0e+00  5  0  6  2  2   5  0  6  2  4     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> SFPack                30 1.0 3.4021e-02 1.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>> SFUnpack              30 1.0 1.9222e-05 1.5 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>>>> 0.00e+00    0 0.00e+00  0
>>>>
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> Memory usage is given in bytes:
>>>>
>>>> Object Type          Creations   Destructions     Memory  Descendants'
>>>> Mem.
>>>> Reports information only for process 0.
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>>               Matrix     3              0            0     0.
>>>>               Viewer     2              0            0     0.
>>>>               Vector     4              1         1792     0.
>>>>            Index Set     2              2    335250404     0.
>>>>    Star Forest Graph     1              0            0     0.
>>>>
>>>> ========================================================================================================================
>>>> Average time to get PetscTime(): 3.77e-08
>>>> Average time for MPI_Barrier(): 8.754e-07
>>>> Average time for zero size MPI_Send(): 2.6755e-06
>>>> #PETSc Option Table entries:
>>>> -log_view
>>>> -mat_type aijcusparse
>>>> -matrix /p/gpfs1/yadav2/tensors//petsc/kmer_V1r.petsc
>>>> -n 20
>>>> -vec_type cuda
>>>> -warmup 10
>>>> #End of PETSc Option Table entries
>>>> Compiled without FORTRAN kernels
>>>> Compiled with full precision matrices (default)
>>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>>> Configure options: --download-c2html=0 --download-hwloc=0
>>>> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0
>>>> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so
>>>> /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so"
>>>> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>>> --with-clanguage=C --with-cxx-dialect=C++17
>>>> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++
>>>> --with-cuda=1 --with-debugging=0
>>>> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>>> --with-fftw=0
>>>> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4
>>>> --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0
>>>> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0
>>>> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0
>>>> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so
>>>> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3"
>>>> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS=
>>>> PETSC_ARCH=arch-linux-c-opt
>>>> -----------------------------------------
>>>> Libraries compiled on 2022-01-14 20:56:04 on lassen99
>>>> Machine characteristics:
>>>> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo
>>>> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install
>>>> Using PETSc arch:
>>>> -----------------------------------------
>>>>
>>>> Using C compiler:
>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>>> -g -DNoChange -fPIC "-O3"
>>>> Using Fortran compiler:
>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>>> -g -fPIC
>>>> -----------------------------------------
>>>>
>>>> Using include paths:
>>>> -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include
>>>> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include
>>>> -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include
>>>> -----------------------------------------
>>>>
>>>> Using C linker:
>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>>> Using Fortran linker:
>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>>> Using libraries:
>>>> -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib
>>>> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc
>>>> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
>>>> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
>>>> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>>> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>>> -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64
>>>> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64
>>>> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
>>>> -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib
>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl
>>>> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse
>>>> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi
>>>> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath
>>>> -lpthread -lquadmath -lstdc++ -ldl
>>>> -----------------------------------------
>>>>
>>>> On Fri, Jan 14, 2022 at 1:43 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>> There are a few things:
>>>>> * GPU have higher latencies and so you basically need a large
>>>>> enough problem to get GPU speedup
>>>>> * I assume you are assembling the matrix on the CPU. The copy of data
>>>>> to the GPU takes time and you really should be creating the matrix on the
>>>>> GPU
>>>>> * I agree with Barry, Roughly 1M / GPU is around where you start
>>>>> seeing a win but this depends on a lot of things.
>>>>> * There are startup costs, like the CPU-GPU copy. It is best to run
>>>>> one mat-vec, or whatever, push a new stage and then run the benchmark. The
>>>>> timing for this new stage will be separate in the log view data. Look at
>>>>> that.
>>>>>  - You can fake this by running your benchmark many times to amortize
>>>>> any setup costs.
>>>>>
>>>>> On Fri, Jan 14, 2022 at 4:27 PM Rohan Yadav <rohany at alumni.cmu.edu>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm looking to use PETSc with GPUs to do some linear algebra
>>>>>> operations, like SpMV, SPMM etc. Building PETSc with `--with-cuda=1` and
>>>>>> running with `-mat_type aijcusparse -vec_type cuda` gives me a large
>>>>>> slowdown from the same code running on the CPU. This is not entirely
>>>>>> unexpected, as things like data transfer costs across the PCIE might
>>>>>> erroneously be included in my timing. Are there some examples of
>>>>>> benchmarking GPU computations with PETSc, or just the proper way to write
>>>>>> code in PETSc that will work for CPUs and GPUs?
>>>>>>
>>>>>> Rohan
>>>>>>
>>>>>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220114/700ec548/attachment-0001.html>