[petsc-users] PETSc GPU MatMatMult performance question
Barry Smith
bsmith at petsc.dev
Thu Feb 3 16:01:48 CST 2022
> On Feb 3, 2022, at 4:28 PM, Rohan Yadav <rohany at alumni.cmu.edu> wrote:
>
> To be concrete, the first matrix was https://sparse.tamu.edu/LAW/arabic-2005 <https://sparse.tamu.edu/LAW/arabic-2005> and the second was https://sparse.tamu.edu/Schenk/nlpkkt200 <https://sparse.tamu.edu/Schenk/nlpkkt200> (which looks like it does come from the PDE domain?).
You are correct; but the matrix bandwidth is so huge (see how far that off diagonal that second band of nonzeros is is) for two ranks this means the each MPI rank ends up needing the entire right hand side matrix to do the computation. Plus the 0 in the lower diagonal block means that on the second rank there is essentially no work to be done on the GPU at all.
>
> Regardless of the non-zero structure, there is still a significant hit when moving from 1 gpu to multiple GPUs that causes a large number of device to host copies to be performed. If this is a result of the PETSc implementation thats fine -- but if there's something I can do to work around that it would be great.
I don't understand enough about the code that Stefeno pointed to know how easy the performance problem would be to fix in PETSc for sparse times dense matrix product It would still be a problem for the nlpkkt200 <https://sparse.tamu.edu/Schenk/nlpkkt200> as partitioned on two ranks no matter what, but in theory the problem can be fixed by improving the PETSc code.
I recommend finding a different sparse matrix test case which has an appropriate nonzero data structuring and partitioning that one can expect good performance on.
>
> Rohan
>
> On Thu, Feb 3, 2022 at 1:25 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>
> I suspect the new matrix has a very different parallel nonzero structure that results in MOST of the calculations taking place on the CPU (since the "off-diagonal" part of the matrix dominates the non-zero pattern). PETSc is not designed for this type of nonzero structure and will give a bad performance (CPU or GPU); it is not a "PDE-ish" type of nonzero structure.
>
>
>
>> On Feb 3, 2022, at 2:59 PM, Rohan Yadav <rohany at alumni.cmu.edu <mailto:rohany at alumni.cmu.edu>> wrote:
>>
>> I'm sorry, I did a little switch here. The original log view I sent for 2 runs was on a different input matrix. Based on Barry's request I switched to a different matrix as the original one did not fit on 1 GPU.
>>
>> > In the previously sent runs it was about 98% on GPU.
>>
>> Re 98% on the GPU though, my first email had a similar ratio in the log though:
>> ```
>> MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30 8.73e+04 98
>> ```
>> The follow up log might be slightly different as well because I pushed a new log stage as requested by Stefano.
>>
>> Rohan
>>
>> On Thu, Feb 3, 2022 at 11:50 AM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>
>> Mark,
>>
>> Good eye. Something is definitely very different between this run and the previous (options, code change?). In the previously sent runs it was about 98% on GPU.
>>
>> Barry
>>
>>
>>> On Feb 3, 2022, at 12:29 PM, Rohan Yadav <rohany at alumni.cmu.edu <mailto:rohany at alumni.cmu.edu>> wrote:
>>>
>>> > Please send the code that builds the sparse B matrix and the setMatToConstant() routine.
>>>
>>> Setting to a constant:
>>> ```
>>> void setMatToConstant(Mat mat, PetscScalar c) {
>>> PetscInt rStart, rEnd, m, n;
>>> MatGetSize(mat, &m, &n);
>>> MatGetOwnershipRange(mat, &rStart, &rEnd);
>>> for (int i = rStart; i < rEnd; i++) {
>>> for (int j = 0; j < n; j++) {
>>> MatSetValue(mat, i, j, c, INSERT_VALUES);
>>> }
>>> }
>>> MatAssemblyBegin(mat, MAT_FINAL_ASSEMBLY);
>>> MatAssemblyEnd(mat, MAT_FINAL_ASSEMBLY);
>>> }
>>> ```
>>>
>>> Loading sparse matrix from disk:
>>> ```
>>> int loadMatrixFromFile(Mat* A, char* filename) {
>>> auto ierr = MatCreate(PETSC_COMM_WORLD, A); CHKERRQ(ierr);
>>> MatSetFromOptions(*A);
>>> PetscViewer viewer;
>>> PetscViewerCreate(PETSC_COMM_WORLD, &viewer);
>>> PetscViewerSetType(viewer, PETSCVIEWERBINARY);
>>> PetscViewerFileSetMode(viewer, FILE_MODE_READ);
>>> PetscViewerFileSetName(viewer, filename);
>>> MatLoad(*A, viewer);
>>> return 0;
>>> }
>>> ```
>>> These are only called once and should not affect the computation in a loop though.
>>> > But first please verify that if you run with one MPI rank the "on GPU" and the overall flop rates for the MatMatMult() are almost the same and there is no copy from the GPU for each multiply?
>>>
>>> Yes, with 1 mpi rank / GPU there are no extra copies done. As soon as I move to 2 ranks I see this behavior.
>>>
>>> Here are updated logs with a new stage for 2 ranks. I've staged the logs into "MyComputation".
>>>
>>> ```
>>> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
>>>
>>> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen572 with 2 processors, by yadav2 Thu Feb 3 09:27:30 2022
>>> Using Petsc Release Version 3.16.3, unknown
>>>
>>> Max Max/Min Avg Total
>>> Time (sec): 2.091e+02 1.001 2.090e+02
>>> Objects: 4.800e+01 1.000 4.800e+01
>>> Flop: 4.344e+11 1.019 4.303e+11 8.606e+11
>>> Flop/sec: 2.077e+09 1.018 2.059e+09 4.118e+09
>>> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01
>>> MPI Message Lengths: 6.316e+10 1.000 1.805e+09 1.263e+11
>>> MPI Reductions: 8.100e+01 1.000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
>>> e.g., VecAXPY() for real vectors of length N --> 2N flop
>>> and VecAXPY() for complex vectors of length N --> 8N flop
>>>
>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --
>>> Avg %Total Avg %Total Count %Total Avg %Total Count %Total
>>> 0: Main Stage: 1.0555e+02 50.5% 2.8686e+11 33.3% 3.000e+01 42.9% 1.466e+09 34.8% 4.300e+01 53.1%
>>> 1: MyComputation: 1.0345e+02 49.5% 5.7373e+11 66.7% 4.000e+01 57.1% 2.058e+09 65.2% 2.000e+01 24.7%
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on interpreting output.
>>> Phase summary info:
>>> Count: number of times phase was executed
>>> Time and Flop: Max - maximum over all processors
>>> Ratio - ratio of maximum to minimum over all processors
>>> Mess: number of messages sent
>>> AvgLen: average message length (bytes)
>>> Reduct: number of global reductions
>>> Global: entire computation
>>> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
>>> %T - percent time in this phase %F - percent flop in this phase
>>> %M - percent messages in this phase %L - percent message lengths in this phase
>>> %R - percent reductions in this phase
>>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
>>> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)
>>> CpuToGpu Count: total number of CPU to GPU copies per processor
>>> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)
>>> GpuToCpu Count: total number of GPU to CPU copies per processor
>>> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)
>>> GPU %F: percent flops on GPU in this event
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU
>>> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> BuildTwoSided 2 1.0 4.0085e-0136.3 0.00e+00 0.0 2.0e+00 4.0e+00 2.0e+00 0 0 3 0 2 0 0 7 0 5 0 0 0 0.00e+00 0 0.00e+00 0
>>> BuildTwoSidedF 1 1.0 4.0080e-0113602.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0
>>> MatAssemblyBegin 12 1.0 4.0084e-017217.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0
>>> MatAssemblyEnd 12 1.0 3.4970e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 2 0 0 0 7 3 0 0 0 14 0 0 0 0.00e+00 0 0.00e+00 0
>>> MatZeroEntries 1 1.0 2.4093e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
>>> MatLoad 1 1.0 1.3756e+01 1.0 0.00e+00 0.0 6.0e+00 4.6e+08 2.1e+01 7 0 9 2 26 13 0 20 6 49 0 0 0 0.00e+00 0 0.00e+00 0
>>> MatMatMultSym 20 1.0 4.7919e+00 2.4 0.00e+00 0.0 4.0e+00 1.6e+07 1.2e+01 2 0 6 0 15 3 0 13 0 28 0 0 0 0.00e+00 0 0.00e+00 0
>>> MatMatMultNum 10 1.0 4.9853e+01 1.1 1.45e+11 1.0 2.0e+01 2.1e+09 0.0e+00 23 33 29 33 0 46100 67 94 0 5754 182686 2 2.23e+03 10 2.08e+04 5
>>> MatCUSPARSCopyTo 1 1.0 2.2646e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 1.55e+02 0 0.00e+00 0
>>> MatDenseCopyTo 1 1.0 1.6636e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.08e+03 0 0.00e+00 0
>>> MatDenseCopyFrom 11 1.0 3.0463e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 3 0 0 0 0 0 0 0 0.00e+00 11 2.29e+04 0
>>> VecSet 3 1.0 5.0035e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
>>> SFSetGraph 1 1.0 4.4294e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
>>> SFSetUp 1 1.0 1.3982e-01 1.0 0.00e+00 0.0 4.0e+00 1.6e+07 1.0e+00 0 0 6 0 1 0 0 13 0 2 0 0 0 0.00e+00 0 0.00e+00 0
>>>
>>> --- Event Stage 1: MyComputation
>>>
>>> MatAssemblyBegin 20 1.0 1.6894e-05 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
>>> MatAssemblyEnd 20 1.0 1.5575e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
>>> MatMatMultSym 40 1.0 1.0096e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+01 3 0 0 0 25 7 0 0 0100 0 0 0 0.00e+00 0 0.00e+00 0
>>> MatMatMultNum 20 1.0 9.9320e+01 1.1 2.90e+11 1.0 4.0e+01 2.1e+09 0.0e+00 46 67 57 65 0 93100100100 0 5777 182577 0 0.00e+00 20 4.16e+04 5
>>> MatDenseCopyFrom 20 1.0 5.5380e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 5 0 0 0 0 0 0 0 0.00e+00 20 4.16e+04 0
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type Creations Destructions Memory Descendants' Mem.
>>> Reports information only for process 0.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> Matrix 17 10 20381695840 0.
>>> Viewer 2 0 0 0.
>>> Vector 4 1 1792 0.
>>> Index Set 2 2 31848152 0.
>>> Star Forest Graph 3 0 0 0.
>>>
>>> --- Event Stage 1: MyComputation
>>>
>>> Matrix 20 20 40763391680 0.
>>> ========================================================================================================================
>>> Average time to get PetscTime(): 3.96e-08
>>> Average time for MPI_Barrier(): 8.184e-07
>>> Average time for zero size MPI_Send(): 2.8165e-06
>>> #PETSc Option Table entries:
>>> -bench spmm
>>> -enable_gpu
>>> -log_view
>>> -mat_type aijcusparse
>>> -matload_block_size 1
>>> -matrix /p/gpfs1/yadav2/tensors/petsc/nlpkkt200.petsc
>>> -n 20
>>> -vec_type cuda
>>> -warmup 10
>>> #End of PETSc Option Table entries
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>> Configure options: --download-c2html=0 --download-hwloc=0 --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0 --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc --with-clanguage=C --with-cxx-dialect=C++17 --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++ --with-cuda=1 --with-debugging=0 --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran --with-fftw=0 --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4 --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= PETSC_ARCH=arch-linux-c-opt
>>> -----------------------------------------
>>> Libraries compiled on 2022-01-21 06:41:50 on lassen111
>>> Machine characteristics: Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo
>>> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install
>>> Using PETSc arch:
>>> -----------------------------------------
>>>
>>> Using C compiler: /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc -g -DNoChange -fPIC "-O3"
>>> Using Fortran compiler: /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran -g -fPIC
>>> -----------------------------------------
>>>
>>> Using include paths: -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include
>>> -----------------------------------------
>>>
>>> Using C linker: /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>> Using Fortran linker: /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>> Using libraries: -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread -lquadmath -lstdc++ -ldl
>>> -----------------------------------------
>>> ```
>>>
>>> On Wed, Feb 2, 2022 at 11:59 PM Stefano Zampini <stefano.zampini at gmail.com <mailto:stefano.zampini at gmail.com>> wrote:
>>>
>>>
>>> 1) It uses MatMPIDenseScatter() to move to the other ranks their needed rows of the C matrix. That function has the call MatDenseGetArrayRead() normally would trigger a copy of C up to the CPU each time. But since C is not changing in your test run I guess it only triggers one copy.
>>>
>>> 2) If uses MatMatMultNumericAdd_SeqAIJ_SeqDense(aij->B,workB,cdense->A,PETSC_TRUE);CHKERRQ(ierr); to do the off diagonal part of the product but this triggers for each multiply a copy of the result matrix from the CPU to the GPU (hugely expensive)
>>>
>>> For performance there needs to be a new routine MatMatMultNumeric_MPIAIJCUSPRSE_MPICUDADense() that is smarter about the needed MPI communication so it only moves exactly what it needs to the other ranks and it does the off-diagonal part of the product on the GPU so it does not need to copy the result up to the CPU.
>>>
>>>
>>> MPIAIJCUSPARSE uses MatProductSetFromOptions_MPIAIJBACKEND
>>>
>>> Rohan
>>> I would suggest to add PetscLogStage around your performance loop (do a warmup outside of it) and send the relevant portion of the log
>>>
>>> Barry
>>>
>>>
>>>
>>>
>>>
>>>
>>>> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
>>>>
>>>> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen457 with 2 processors, by yadav2 Wed Feb 2 17:23:19 2022
>>>> Using Petsc Release Version 3.16.3, unknown
>>>>
>>>> Max Max/Min Avg Total
>>>> Time (sec): 1.163e+02 1.000 1.163e+02
>>>> Objects: 4.800e+01 1.000 4.800e+01
>>>> Flop: 6.338e+11 1.065 6.144e+11 1.229e+12
>>>> Flop/sec: 5.451e+09 1.065 5.284e+09 1.057e+10
>>>> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01
>>>> MPI Message Lengths: 2.544e+09 1.000 7.267e+07 5.087e+09
>>>> MPI Reductions: 8.100e+01 1.000
>>>>
>>>> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
>>>> e.g., VecAXPY() for real vectors of length N --> 2N flop
>>>> and VecAXPY() for complex vectors of length N --> 8N flop
>>>>
>>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --
>>>> Avg %Total Avg %Total Count %Total Avg %Total Count %Total
>>>> 0: Main Stage: 1.1628e+02 100.0% 1.2288e+12 100.0% 7.000e+01 100.0% 7.267e+07 100.0% 6.300e+01 77.8%
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> See the 'Profiling' chapter of the users' manual for details on interpreting output.
>>>> Phase summary info:
>>>> Count: number of times phase was executed
>>>> Time and Flop: Max - maximum over all processors
>>>> Ratio - ratio of maximum to minimum over all processors
>>>> Mess: number of messages sent
>>>> AvgLen: average message length (bytes)
>>>> Reduct: number of global reductions
>>>> Global: entire computation
>>>> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
>>>> %T - percent time in this phase %F - percent flop in this phase
>>>> %M - percent messages in this phase %L - percent message lengths in this phase
>>>> %R - percent reductions in this phase
>>>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
>>>> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)
>>>> CpuToGpu Count: total number of CPU to GPU copies per processor
>>>> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)
>>>> GpuToCpu Count: total number of GPU to CPU copies per processor
>>>> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)
>>>> GPU %F: percent flops on GPU in this event
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU
>>>> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> BuildTwoSided 2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00 2.0e+00 0 0 3 0 2 0 0 3 0 3 0 0 0 0.00e+00 0 0.00e+00 0
>>>> BuildTwoSidedF 1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0
>>>> MatAssemblyBegin 32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0
>>>> MatAssemblyEnd 32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 1 0 0 0 7 1 0 0 0 10 0 0 0 0.00e+00 0 0.00e+00 0
>>>> MatZeroEntries 1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
>>>> MatLoad 1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07 2.1e+01 15 0 9-10 26 15 0 9-10 33 0 0 0 0.00e+00 0 0.00e+00 0
>>>> MatMatMultSym 60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05 3.2e+01 1 0 6 0 40 1 0 6 0 51 0 0 0 0.00e+00 0 0.00e+00 0
>>>> MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30 8.73e+04 98
>>>> MatCUSPARSCopyTo 1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 3.80e+03 0 0.00e+00 0
>>>> MatDenseCopyTo 1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.91e+03 0 0.00e+00 0
>>>> MatDenseCopyFrom 31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 31 9.02e+04 0
>>>> VecSet 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
>>>> SFSetGraph 1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
>>>> SFSetUp 1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05 1.0e+00 0 0 6 0 1 0 0 6 0 2 0 0 0 0.00e+00 0 0.00e+00 0
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> Memory usage is given in bytes:
>>>>
>>>> Object Type Creations Destructions Memory Descendants' Mem.
>>>> Reports information only for process 0.
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> Matrix 37 30 2867511840 0.
>>>> Viewer 2 0 0 0.
>>>> Vector 4 1 1792 0.
>>>> Index Set 2 2 1495248 0.
>>>> Star Forest Graph 3 0 0 0.
>>>> ========================================================================================================================
>>>> Average time to get PetscTime(): 3.83e-08
>>>> Average time for MPI_Barrier(): 7.874e-07
>>>> Average time for zero size MPI_Send(): 3.4035e-06
>>>> #PETSc Option Table entries:
>>>> -bench spmm
>>>> -enable_gpu
>>>> -log_view
>>>> -mat_type aijcusparse
>>>> -matload_block_size 1
>>>> -matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc
>>>> -n 20
>>>> -vec_type cuda
>>>> -warmup 10
>>>> ```
>>>>
>>>> Thanks,
>>>>
>>>> Rohan Yadav
>>>>
>>>
>>>
>>>
>>> --
>>> Stefano
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220203/9f3090ec/attachment-0001.html>
More information about the petsc-users
mailing list