[petsc-dev] Kokkos/Crusher perforance

Jed Brown jed at jedbrown.org
Sat Jan 22 09:25:30 CST 2022


Mark Adams <mfadams at lbl.gov> writes:

> On Fri, Jan 21, 2022 at 9:55 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>> Interesting, Is this with all native Kokkos kernels or do some kokkos
>> kernels use rocm?
>>
>
> Ah, good question. I often run with tpl=0 but I did not specify here on
> Crusher. In looking at the log files I see
> -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-olcf-crusher/externalpackages/git.kokkos-kernels/src/impl/tpls
>
> Here is a run with tpls turned off. These tpl includes are gone.
>
> It looks pretty much the same. A little slower but that could be noise.

> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
> ************************************************************************************************************************

We gotta say 160 chars because that's what we use now.

> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
>
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a arch-olcf-crusher named crusher001 with 64 processors, by adams Fri Jan 21 23:48:31 2022
> Using Petsc Development GIT revision: v3.16.3-665-g1012189b9a  GIT Date: 2022-01-21 16:28:20 +0000
>
>                          Max       Max/Min     Avg       Total
> Time (sec):           7.919e+01     1.000   7.918e+01
> Objects:              2.088e+03     1.164   1.852e+03
> Flop:                 2.448e+10     1.074   2.393e+10  1.532e+12
> Flop/sec:             3.091e+08     1.074   3.023e+08  1.935e+10
> MPI Messages:         1.651e+04     3.673   9.388e+03  6.009e+05
> MPI Message Lengths:  2.278e+08     2.093   1.788e+04  1.074e+10
> MPI Reductions:       1.988e+03     1.000
>
> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N --> 2N flop
>                             and VecAXPY() for complex vectors of length N --> 8N flop
>
> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total    Count   %Total     Avg         %Total    Count   %Total
>  0:      Main Stage: 7.4289e+01  93.8%  6.0889e+11  39.8%  2.265e+05  37.7%  2.175e+04       45.8%  7.630e+02  38.4%
>  1:         PCSetUp: 3.1604e-02   0.0%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
>  2:  KSP Solve only: 4.8576e+00   6.1%  9.2287e+11  60.2%  3.744e+05  62.3%  1.554e+04       54.2%  1.206e+03  60.7%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flop: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    AvgLen: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flop in this phase
>       %M - percent messages in this phase     %L - percent message lengths in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
>    GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)
>    CpuToGpu Count: total number of CPU to GPU copies per processor
>    CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)
>    GpuToCpu Count: total number of GPU to CPU copies per processor
>    GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)
>    GPU %F: percent flops on GPU in this event
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> PetscBarrier           5 1.0 2.0665e-01 1.1 0.00e+00 0.0 1.1e+04 8.0e+02 1.8e+01  0  0  2  0  1   0  0  5  0  2     0       0      0 0.00e+00    0 0.00e+00  0
> BuildTwoSided         40 1.0 2.6017e+0010.5 0.00e+00 0.0 9.9e+03 4.0e+00 4.0e+01  3  0  2  0  2   3  0  4  0  5     0       0      0 0.00e+00    0 0.00e+00  0
> BuildTwoSidedF         6 1.0 2.5318e+0010.9 0.00e+00 0.0 2.2e+03 4.0e+05 6.0e+00  3  0  0  8  0   3  0  1 18  1     0       0      0 0.00e+00    0 0.00e+00  0
> MatMult            1210960.2 1.2055e+00 2.1 6.71e+09 1.1 1.9e+05 1.5e+04 2.0e+00  1 27 32 27  0   1 69 85 59  0 346972       0      1 1.14e-01    0 0.00e+00 100
> MatAssemblyBegin      43 1.0 2.6856e+00 6.9 0.00e+00 0.0 2.2e+03 4.0e+05 6.0e+00  3  0  0  8  0   3  0  1 18  1     0       0      0 0.00e+00    0 0.00e+00  0
> MatAssemblyEnd        43 1.0 4.6070e-01 2.5 1.18e+06 0.0 0.0e+00 0.0e+00 9.0e+00  0  0  0  0  0   1  0  0  0  1   120       0      0 0.00e+00    0 0.00e+00  0
> MatZeroEntries         3 1.0 5.4884e-04 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> MatView                1 1.0 2.5364e-03 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> KSPSetUp               1 1.0 2.4612e-03 3.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> KSPSolve               1 1.0 3.1074e+00 1.0 7.39e+09 1.1 1.9e+05 1.5e+04 6.0e+02  4 30 31 27 30   4 76 83 59 79 148494   1022941      1 1.14e-01    0 0.00e+00 100
> SNESSolve              1 1.0 2.7026e+01 1.0 8.56e+09 1.1 1.9e+05 1.8e+04 6.1e+02 34 35 32 31 31  36 88 84 69 80 19853   1022661      3 2.36e+00    2 3.62e+00 86
> SNESSetUp              1 1.0 7.6240e+00 1.0 0.00e+00 0.0 5.3e+03 1.9e+05 1.8e+01 10  0  1  9  1  10  0  2 21  2     0       0      0 0.00e+00    0 0.00e+00  0
> SNESFunctionEval       2 1.0 6.2213e+00 1.1 7.96e+08 1.0 1.7e+03 1.3e+04 3.0e+00  8  3  0  0  0   8  8  1  0  0  8149   21036      3 4.32e+00    2 3.62e+00  0
> SNESJacobianEval       2 1.0 5.7439e+01 1.0 1.52e+09 1.0 1.7e+03 5.4e+05 2.0e+00 72  6  0  8  0  77 16  1 18  0  1683       0      0 0.00e+00    2 3.62e+00  0
> DMCreateInterp         1 1.0 1.0837e-02 1.0 8.29e+04 1.0 1.1e+03 8.0e+02 1.6e+01  0  0  0  0  1   0  0  0  0  2   490       0      0 0.00e+00    0 0.00e+00  0
> DMCreateMat            1 1.0 7.6222e+00 1.0 0.00e+00 0.0 5.3e+03 1.9e+05 1.8e+01 10  0  1  9  1  10  0  2 21  2     0       0      0 0.00e+00    0 0.00e+00  0
> Mesh Partition         1 1.0 2.5208e-02 1.0 0.00e+00 0.0 3.2e+02 1.1e+02 8.0e+00  0  0  0  0  0   0  0  0  0  1     0       0      0 0.00e+00    0 0.00e+00  0
> Mesh Migration         1 1.0 9.2974e-03 1.0 0.00e+00 0.0 1.8e+03 8.3e+01 2.9e+01  0  0  0  0  1   0  0  1  0  4     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexPartSelf         1 1.0 8.4227e-0493.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexPartLblInv       1 1.0 1.0979e-03 4.5 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexPartLblSF        1 1.0 4.5747e-03 1.7 0.00e+00 0.0 1.3e+02 5.6e+01 1.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexPartStrtSF       1 1.0 1.8253e-02 1.7 0.00e+00 0.0 6.3e+01 2.2e+02 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexPointSF          1 1.0 1.9011e-03 1.1 0.00e+00 0.0 1.3e+02 2.7e+02 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexInterp          19 1.0 1.0434e-03 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexDistribute       1 1.0 3.6410e-02 1.0 0.00e+00 0.0 2.2e+03 9.7e+01 3.7e+01  0  0  0  0  2   0  0  1  0  5     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexDistCones        1 1.0 1.1016e-03 1.2 0.00e+00 0.0 3.8e+02 1.4e+02 2.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexDistLabels       1 1.0 1.5538e-03 1.0 0.00e+00 0.0 9.0e+02 6.6e+01 2.4e+01  0  0  0  0  1   0  0  0  0  3     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexDistField        1 1.0 6.3540e-03 1.0 0.00e+00 0.0 4.4e+02 5.9e+01 2.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexStratify        33 1.0 1.4687e-02 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 7.0e+00  0  0  0  0  0   0  0  0  0  1     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexSymmetrize      33 1.0 1.9498e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexPrealloc         1 1.0 7.6108e+00 1.0 0.00e+00 0.0 5.3e+03 1.9e+05 1.6e+01 10  0  1  9  1  10  0  2 21  2     0       0      0 0.00e+00    0 0.00e+00  0
> DMPlexResidualFE       2 1.0 3.7908e+00 1.1 7.87e+08 1.0 0.0e+00 0.0e+00 0.0e+00  5  3  0  0  0   5  8  0  0  0 13285       0      0 0.00e+00    0 0.00e+00  0
> DMPlexJacobianFE       2 1.0 5.7067e+01 1.0 1.51e+09 1.0 1.1e+03 8.0e+05 2.0e+00 72  6  0  8  0  77 16  0 18  0  1689       0      0 0.00e+00    0 0.00e+00  0
> DMPlexInterpFE         1 1.0 1.0649e-02 1.0 8.29e+04 1.0 1.1e+03 8.0e+02 1.6e+01  0  0  0  0  1   0  0  0  0  2   498       0      0 0.00e+00    0 0.00e+00  0
> SFSetGraph            43 1.0 1.0816e-03 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFSetUp               34 1.0 1.5032e-01 1.5 0.00e+00 0.0 1.8e+04 2.1e+04 3.4e+01  0  0  3  3  2   0  0  8  7  4     0       0      0 0.00e+00    0 0.00e+00  0
> SFBcastBegin          65 1.0 2.2730e+00145.4 0.00e+00 0.0 1.3e+04 1.3e+04 0.0e+00  2  0  2  2  0   2  0  6  3  0     0       0      1 1.68e-01    4 7.24e+00  0
> SFBcastEnd            65 1.0 1.7421e+0062.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0       0      0 6.24e-03    0 0.00e+00  0
> SFReduceBegin         16 1.0 1.9556e-0184.2 5.24e+05 1.0 4.2e+03 8.5e+04 0.0e+00  0  0  1  3  0   0  0  2  7  0   170       0      2 4.15e+00    0 0.00e+00 100
> SFReduceEnd           16 1.0 9.7152e-0132.7 2.50e+04 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     1       0      0 0.00e+00    0 0.00e+00 100
> SFFetchOpBegin         2 1.0 3.1814e-03104.2 0.00e+00 0.0 5.6e+02 2.0e+05 0.0e+00  0  0  0  1  0   0  0  0  2  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFFetchOpEnd           2 1.0 2.8296e-02 3.6 0.00e+00 0.0 5.6e+02 2.0e+05 0.0e+00  0  0  0  1  0   0  0  0  2  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFCreateEmbed          8 1.0 1.0733e-0172.8 0.00e+00 0.0 2.0e+03 7.0e+02 0.0e+00  0  0  0  0  0   0  0  1  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFDistSection          9 1.0 1.0892e-02 2.3 0.00e+00 0.0 4.1e+03 5.9e+03 1.1e+01  0  0  1  0  1   0  0  2  0  1     0       0      0 0.00e+00    0 0.00e+00  0
> SFSectionSF           16 1.0 5.2589e-02 2.2 0.00e+00 0.0 5.8e+03 2.0e+04 1.6e+01  0  0  1  1  1   0  0  3  2  2     0       0      0 0.00e+00    0 0.00e+00  0
> SFRemoteOff            7 1.0 1.2178e-0124.0 0.00e+00 0.0 6.1e+03 1.3e+03 4.0e+00  0  0  1  0  0   0  0  3  0  1     0       0      0 0.00e+00    0 0.00e+00  0
> SFPack               290 1.0 7.5146e-01155.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0       0      2 1.51e-01    0 0.00e+00  0
> SFUnpack             292 1.0 1.9789e-0158.9 5.49e+05 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   174       0      0 6.24e-03    0 0.00e+00 100
> VecTDot              401 1.0 1.4788e+00 1.8 2.10e+08 1.0 0.0e+00 0.0e+00 4.0e+02  2  1  0  0 20   2  2  0  0 53  8992   109803      0 0.00e+00    0 0.00e+00 100
> VecNorm              201 1.0 7.4026e-01 2.4 1.05e+08 1.0 0.0e+00 0.0e+00 2.0e+02  0  0  0  0 10   0  1  0  0 26  9004   127483      0 0.00e+00    0 0.00e+00 100
> VecCopy                2 1.0 1.4854e-0310.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecSet                54 1.0 8.7686e-03 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecAXPY              400 1.0 3.9120e-0120.9 2.10e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  2  0  0  0 33909   73190      0 0.00e+00    0 0.00e+00 100
> VecAYPX              199 1.0 1.3597e-01 6.9 1.04e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  1  0  0  0 48535   138139      0 0.00e+00    0 0.00e+00 100
> VecPointwiseMult     201 1.0 1.4152e-0110.2 5.27e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  1  0  0  0 23550   69371      0 0.00e+00    0 0.00e+00 100
> VecScatterBegin      201 1.0 6.5846e-0117.0 0.00e+00 0.0 1.9e+05 1.5e+04 2.0e+00  0  0 32 27  0   0  0 85 59  0     0       0      1 1.14e-01    0 0.00e+00  0
> VecScatterEnd        201 1.0 6.6968e-01 9.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> DualSpaceSetUp         2 1.0 5.2698e-03 1.2 1.80e+03 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    22       0      0 0.00e+00    0 0.00e+00  0
> FESetUp                2 1.0 3.3009e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> PCSetUp                1 1.0 9.6290e-06 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> PCApply              201 1.0 1.9920e-01 2.9 5.27e+07 1.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  0   0  1  0  0  0 16731   47897      0 0.00e+00    0 0.00e+00 100
>
> --- Event Stage 1: PCSetUp
>
> PCSetUp                1 1.0 3.6638e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0 100  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
>
> --- Event Stage 2: KSP Solve only
>
> MatMult              400 1.0 1.3375e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 0.0e+00  1 55 62 54  0  24 91100100  0 625440       0      0 0.00e+00    0 0.00e+00 100

So this is 3.3 ms per iteration. It involves a few kernels with sync through the host, but on other GPUs, it's when you get under 1 ms (or perhaps 500 µs) that perf really suffers.

625 GF is 3750 TB/s. What is STREAM here? If each chiplet of the 4 dual-chiplet devices realizes at least MI100 bandwidth, then this is less than 50%. I wonder what others have seen.

> MatView                2 1.0 4.3457e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> KSPSolve               2 1.0 4.9810e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 1.2e+03  6 60 62 54 61 100100100100100 185277   1102535      0 0.00e+00    0 0.00e+00 100
> SFPack               400 1.0 2.6830e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFUnpack             400 1.0 2.2198e-04 4.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecTDot              802 1.0 2.0418e+00 1.2 4.20e+08 1.0 0.0e+00 0.0e+00 8.0e+02  2  2  0  0 40  38  3  0  0 67 13026   112538      0 0.00e+00    0 0.00e+00 100
> VecNorm              402 1.0 1.4270e+00 2.4 2.11e+08 1.0 0.0e+00 0.0e+00 4.0e+02  1  1  0  0 20  14  1  0  0 33  9343   134367      0 0.00e+00    0 0.00e+00 100
> VecCopy                4 1.0 5.9396e-0324.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecSet                 4 1.0 3.7188e-0313.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecAXPY              800 1.0 7.4812e-0121.6 4.19e+08 1.0 0.0e+00 0.0e+00 0.0e+00  1  2  0  0  0  14  3  0  0  0 35463   73999      0 0.00e+00    0 0.00e+00 100
> VecAYPX              398 1.0 2.5369e-01 6.5 2.09e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   4  1  0  0  0 52028   142028      0 0.00e+00    0 0.00e+00 100

Still weird that these are so different. Going off the total times, 400 µs is lots for just one kernel launch and yet the bandwidth is abysmal (using any of the numbers).

> VecPointwiseMult     402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   5  1  0  0  0 22515   70608      0 0.00e+00    0 0.00e+00 100
> VecScatterBegin      400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05 1.6e+04 0.0e+00  0  0 62 54  0   2  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecScatterEnd        400 1.0 1.0057e+00 7.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> PCApply              402 1.0 2.9638e-01 3.6 1.05e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   5  1  0  0  0 22490   70608      0 0.00e+00    0 0.00e+00 100

Most of the MatMult time is attributed to VecScatterEnd here. Can you share a run of the same total problem size on 8 ranks (one rank per GPU)? 

>From the other log file (10x bigger problem)

> --- Event Stage 2: KSP Solve only
>
> MatMult              400 1.0 9.4001e+00 1.2 1.07e+11 1.0 3.7e+05 6.1e+04 0.0e+00  2 55 62 54  0  65 91100100  0 721451       0      0 0.00e+00    0 0.00e+00 100

Similar bandwidth

> MatView                2 1.0 4.4729e-03 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> KSPSolve               2 1.0 1.3945e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04 1.2e+03  2 60 62 54 60 100100100100100 536128   2881308      0 0.00e+00    0 0.00e+00 100

I think the GPU Mflop/s number here is nonsense, but if it were accurate (and about 6 bytes/flop average), this would be 2.88*6=17.28 GB/s. The marketing material says 3.2 TB/s per (dual-chiplet) device, or 12.8. Those numbers are always nonsense, but our GPU Mflop/s exceeds it.

> SFPack               400 1.0 2.4445e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFUnpack             400 1.0 1.2255e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecTDot              802 1.0 2.7256e+00 1.9 3.36e+09 1.0 0.0e+00 0.0e+00 8.0e+02  0  2  0  0 40  14  3  0  0 67 78523   335526      0 0.00e+00    0 0.00e+00 100
> VecNorm              402 1.0 1.9145e+00 3.7 1.69e+09 1.0 0.0e+00 0.0e+00 4.0e+02  0  1  0  0 20   6  1  0  0 33 56035   533339      0 0.00e+00    0 0.00e+00 100
> VecCopy                4 1.0 6.3156e-03 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecSet                 4 1.0 3.8228e-0315.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecAXPY              800 1.0 9.0587e-0111.1 3.36e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0   6  3  0  0  0 235676   444654      0 0.00e+00    0 0.00e+00 100
> VecAYPX              398 1.0 1.9393e+0029.6 1.67e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   6  1  0  0  0 54767   65448      0 0.00e+00    0 0.00e+00 100

So if we look at GPU flops, this is around 3.5 TB/s for VecAXPY, which still seems low. Weird that it's now faster than VecAYPX (I don't trust these timings).

> VecPointwiseMult     402 1.0 3.5580e-01 6.2 8.43e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   2  1  0  0  0 150758   318605      0 0.00e+00    0 0.00e+00 100
> VecScatterBegin      400 1.0 1.3900e+0028.9 0.00e+00 0.0 3.7e+05 6.1e+04 0.0e+00  0  0 62 54  0   7  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecScatterEnd        400 1.0 5.8686e+00 6.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0

Still more than half the MatMult is attributed to VecScatterEnd

> PCApply              402 1.0 3.5612e-01 6.1 8.43e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   2  1  0  0  0 150622   318605      0 0.00e+00    0 0.00e+00 100


More information about the petsc-dev mailing list