[petsc-users] PETSc GPU MatMatMult performance question

Stefano Zampini stefano.zampini at gmail.com
Thu Feb 3 01:59:40 CST 2022


1) It uses MatMPIDenseScatter() to move to the other ranks their needed
> rows of the C matrix. That function has the call MatDenseGetArrayRead()
> normally would trigger a copy of C up to the CPU each time. But since C is
> not changing in your test run I guess it only triggers one copy.
>
> 2) If uses
> MatMatMultNumericAdd_SeqAIJ_SeqDense(aij->B,workB,cdense->A,PETSC_TRUE);CHKERRQ(ierr);
> to do the off diagonal part of the product but this triggers for each
> multiply a copy of the result matrix from the CPU to the GPU (hugely
> expensive)
>
> For performance there needs to be a new routine MatMatMultNumeric_MPIAIJCUSPRSE_MPICUDADense()
> that is smarter about the needed MPI communication so it only moves exactly
> what it needs to the other ranks and it does the off-diagonal part of the
> product on the GPU so it does not need to copy the result up to the CPU.
>
>
MPIAIJCUSPARSE uses MatProductSetFromOptions_MPIAIJBACKEND

Rohan
I would suggest to add PetscLogStage around your performance loop (do a
warmup outside of it) and send the relevant portion of the log


> Barry
>
>
>
>
>
>
> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
>
> /g/g15/yadav2/taco/petsc/bin/benchmark on a  named lassen457 with 2 processors, by yadav2 Wed Feb  2 17:23:19 2022
> Using Petsc Release Version 3.16.3, unknown
>
>                          Max       Max/Min     Avg       Total
> Time (sec):           1.163e+02     1.000   1.163e+02
> Objects:              4.800e+01     1.000   4.800e+01
> Flop:                 6.338e+11     1.065   6.144e+11  1.229e+12
> Flop/sec:             5.451e+09     1.065   5.284e+09  1.057e+10
> MPI Messages:         3.500e+01     1.000   3.500e+01  7.000e+01
> MPI Message Lengths:  2.544e+09     1.000   7.267e+07  5.087e+09
> MPI Reductions:       8.100e+01     1.000
>
> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N --> 2N flop
>                             and VecAXPY() for complex vectors of length N --> 8N flop
>
> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total    Count   %Total     Avg         %Total    Count   %Total
>  0:      Main Stage: 1.1628e+02 100.0%  1.2288e+12 100.0%  7.000e+01 100.0%  7.267e+07      100.0%  6.300e+01  77.8%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flop: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    AvgLen: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flop in this phase
>       %M - percent messages in this phase     %L - percent message lengths in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
>    GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)
>    CpuToGpu Count: total number of CPU to GPU copies per processor
>    CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)
>    GpuToCpu Count: total number of GPU to CPU copies per processor
>    GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)
>    GPU %F: percent flops on GPU in this event
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSided          2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00 2.0e+00  0  0  3  0  2   0  0  3  0  3     0       0      0 0.00e+00    0 0.00e+00  0
> BuildTwoSidedF         1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0 0.00e+00    0 0.00e+00  0
> MatAssemblyBegin      32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0 0.00e+00    0 0.00e+00  0
> MatAssemblyEnd        32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00  1  0  0  0  7   1  0  0  0 10     0       0      0 0.00e+00    0 0.00e+00  0
> MatZeroEntries         1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> MatLoad                1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07 2.1e+01 15  0  9-10 26  15  0  9-10 33     0       0      0 0.00e+00    0 0.00e+00  0
> MatMatMultSym         60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05 3.2e+01  1  0  6  0 40   1  0  6  0 51     0       0      0 0.00e+00    0 0.00e+00  0
> MatMatMultNum         30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 0.0e+00 37100 86110  0  37100 86110  0 28598   920026      2 6.71e+03   30 8.73e+04 98
> MatCUSPARSCopyTo       1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1 3.80e+03    0 0.00e+00  0
> MatDenseCopyTo         1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1 2.91e+03    0 0.00e+00  0
> MatDenseCopyFrom      31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10  0  0  0  0  10  0  0  0  0     0       0      0 0.00e+00   31 9.02e+04  0
> VecSet                 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFSetGraph             1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFSetUp                1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05 1.0e+00  0  0  6  0  1   0  0  6  0  2     0       0      0 0.00e+00    0 0.00e+00  0
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>               Matrix    37             30   2867511840     0.
>               Viewer     2              0            0     0.
>               Vector     4              1         1792     0.
>            Index Set     2              2      1495248     0.
>    Star Forest Graph     3              0            0     0.
> ========================================================================================================================
> Average time to get PetscTime(): 3.83e-08
> Average time for MPI_Barrier(): 7.874e-07
> Average time for zero size MPI_Send(): 3.4035e-06
> #PETSc Option Table entries:
> -bench spmm
> -enable_gpu
> -log_view
> -mat_type aijcusparse
> -matload_block_size 1
> -matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc
> -n 20
> -vec_type cuda
> -warmup 10
> ```
>
>
> Thanks,
>
>
> Rohan Yadav
>
>
>
>

-- 
Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220203/39a4682e/attachment.html>


More information about the petsc-users mailing list