[petsc-dev] Kokkos/Crusher perforance
Jed Brown
jed at jedbrown.org
Sat Jan 22 09:25:30 CST 2022
Mark Adams <mfadams at lbl.gov> writes:
> On Fri, Jan 21, 2022 at 9:55 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>> Interesting, Is this with all native Kokkos kernels or do some kokkos
>> kernels use rocm?
>>
>
> Ah, good question. I often run with tpl=0 but I did not specify here on
> Crusher. In looking at the log files I see
> -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-olcf-crusher/externalpackages/git.kokkos-kernels/src/impl/tpls
>
> Here is a run with tpls turned off. These tpl includes are gone.
>
> It looks pretty much the same. A little slower but that could be noise.
> ************************************************************************************************************************
> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
> ************************************************************************************************************************
We gotta say 160 chars because that's what we use now.
> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
>
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a arch-olcf-crusher named crusher001 with 64 processors, by adams Fri Jan 21 23:48:31 2022
> Using Petsc Development GIT revision: v3.16.3-665-g1012189b9a GIT Date: 2022-01-21 16:28:20 +0000
>
> Max Max/Min Avg Total
> Time (sec): 7.919e+01 1.000 7.918e+01
> Objects: 2.088e+03 1.164 1.852e+03
> Flop: 2.448e+10 1.074 2.393e+10 1.532e+12
> Flop/sec: 3.091e+08 1.074 3.023e+08 1.935e+10
> MPI Messages: 1.651e+04 3.673 9.388e+03 6.009e+05
> MPI Message Lengths: 2.278e+08 2.093 1.788e+04 1.074e+10
> MPI Reductions: 1.988e+03 1.000
>
> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N --> 2N flop
> and VecAXPY() for complex vectors of length N --> 8N flop
>
> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total Count %Total Avg %Total Count %Total
> 0: Main Stage: 7.4289e+01 93.8% 6.0889e+11 39.8% 2.265e+05 37.7% 2.175e+04 45.8% 7.630e+02 38.4%
> 1: PCSetUp: 3.1604e-02 0.0% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
> 2: KSP Solve only: 4.8576e+00 6.1% 9.2287e+11 60.2% 3.744e+05 62.3% 1.554e+04 54.2% 1.206e+03 60.7%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flop: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> AvgLen: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
> %T - percent time in this phase %F - percent flop in this phase
> %M - percent messages in this phase %L - percent message lengths in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)
> CpuToGpu Count: total number of CPU to GPU copies per processor
> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)
> GpuToCpu Count: total number of GPU to CPU copies per processor
> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)
> GPU %F: percent flops on GPU in this event
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU
> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> PetscBarrier 5 1.0 2.0665e-01 1.1 0.00e+00 0.0 1.1e+04 8.0e+02 1.8e+01 0 0 2 0 1 0 0 5 0 2 0 0 0 0.00e+00 0 0.00e+00 0
> BuildTwoSided 40 1.0 2.6017e+0010.5 0.00e+00 0.0 9.9e+03 4.0e+00 4.0e+01 3 0 2 0 2 3 0 4 0 5 0 0 0 0.00e+00 0 0.00e+00 0
> BuildTwoSidedF 6 1.0 2.5318e+0010.9 0.00e+00 0.0 2.2e+03 4.0e+05 6.0e+00 3 0 0 8 0 3 0 1 18 1 0 0 0 0.00e+00 0 0.00e+00 0
> MatMult 1210960.2 1.2055e+00 2.1 6.71e+09 1.1 1.9e+05 1.5e+04 2.0e+00 1 27 32 27 0 1 69 85 59 0 346972 0 1 1.14e-01 0 0.00e+00 100
> MatAssemblyBegin 43 1.0 2.6856e+00 6.9 0.00e+00 0.0 2.2e+03 4.0e+05 6.0e+00 3 0 0 8 0 3 0 1 18 1 0 0 0 0.00e+00 0 0.00e+00 0
> MatAssemblyEnd 43 1.0 4.6070e-01 2.5 1.18e+06 0.0 0.0e+00 0.0e+00 9.0e+00 0 0 0 0 0 1 0 0 0 1 120 0 0 0.00e+00 0 0.00e+00 0
> MatZeroEntries 3 1.0 5.4884e-04 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> MatView 1 1.0 2.5364e-03 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> KSPSetUp 1 1.0 2.4612e-03 3.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> KSPSolve 1 1.0 3.1074e+00 1.0 7.39e+09 1.1 1.9e+05 1.5e+04 6.0e+02 4 30 31 27 30 4 76 83 59 79 148494 1022941 1 1.14e-01 0 0.00e+00 100
> SNESSolve 1 1.0 2.7026e+01 1.0 8.56e+09 1.1 1.9e+05 1.8e+04 6.1e+02 34 35 32 31 31 36 88 84 69 80 19853 1022661 3 2.36e+00 2 3.62e+00 86
> SNESSetUp 1 1.0 7.6240e+00 1.0 0.00e+00 0.0 5.3e+03 1.9e+05 1.8e+01 10 0 1 9 1 10 0 2 21 2 0 0 0 0.00e+00 0 0.00e+00 0
> SNESFunctionEval 2 1.0 6.2213e+00 1.1 7.96e+08 1.0 1.7e+03 1.3e+04 3.0e+00 8 3 0 0 0 8 8 1 0 0 8149 21036 3 4.32e+00 2 3.62e+00 0
> SNESJacobianEval 2 1.0 5.7439e+01 1.0 1.52e+09 1.0 1.7e+03 5.4e+05 2.0e+00 72 6 0 8 0 77 16 1 18 0 1683 0 0 0.00e+00 2 3.62e+00 0
> DMCreateInterp 1 1.0 1.0837e-02 1.0 8.29e+04 1.0 1.1e+03 8.0e+02 1.6e+01 0 0 0 0 1 0 0 0 0 2 490 0 0 0.00e+00 0 0.00e+00 0
> DMCreateMat 1 1.0 7.6222e+00 1.0 0.00e+00 0.0 5.3e+03 1.9e+05 1.8e+01 10 0 1 9 1 10 0 2 21 2 0 0 0 0.00e+00 0 0.00e+00 0
> Mesh Partition 1 1.0 2.5208e-02 1.0 0.00e+00 0.0 3.2e+02 1.1e+02 8.0e+00 0 0 0 0 0 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0
> Mesh Migration 1 1.0 9.2974e-03 1.0 0.00e+00 0.0 1.8e+03 8.3e+01 2.9e+01 0 0 0 0 1 0 0 1 0 4 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexPartSelf 1 1.0 8.4227e-0493.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexPartLblInv 1 1.0 1.0979e-03 4.5 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexPartLblSF 1 1.0 4.5747e-03 1.7 0.00e+00 0.0 1.3e+02 5.6e+01 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexPartStrtSF 1 1.0 1.8253e-02 1.7 0.00e+00 0.0 6.3e+01 2.2e+02 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexPointSF 1 1.0 1.9011e-03 1.1 0.00e+00 0.0 1.3e+02 2.7e+02 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexInterp 19 1.0 1.0434e-03 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexDistribute 1 1.0 3.6410e-02 1.0 0.00e+00 0.0 2.2e+03 9.7e+01 3.7e+01 0 0 0 0 2 0 0 1 0 5 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexDistCones 1 1.0 1.1016e-03 1.2 0.00e+00 0.0 3.8e+02 1.4e+02 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexDistLabels 1 1.0 1.5538e-03 1.0 0.00e+00 0.0 9.0e+02 6.6e+01 2.4e+01 0 0 0 0 1 0 0 0 0 3 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexDistField 1 1.0 6.3540e-03 1.0 0.00e+00 0.0 4.4e+02 5.9e+01 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexStratify 33 1.0 1.4687e-02 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 7.0e+00 0 0 0 0 0 0 0 0 0 1 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexSymmetrize 33 1.0 1.9498e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexPrealloc 1 1.0 7.6108e+00 1.0 0.00e+00 0.0 5.3e+03 1.9e+05 1.6e+01 10 0 1 9 1 10 0 2 21 2 0 0 0 0.00e+00 0 0.00e+00 0
> DMPlexResidualFE 2 1.0 3.7908e+00 1.1 7.87e+08 1.0 0.0e+00 0.0e+00 0.0e+00 5 3 0 0 0 5 8 0 0 0 13285 0 0 0.00e+00 0 0.00e+00 0
> DMPlexJacobianFE 2 1.0 5.7067e+01 1.0 1.51e+09 1.0 1.1e+03 8.0e+05 2.0e+00 72 6 0 8 0 77 16 0 18 0 1689 0 0 0.00e+00 0 0.00e+00 0
> DMPlexInterpFE 1 1.0 1.0649e-02 1.0 8.29e+04 1.0 1.1e+03 8.0e+02 1.6e+01 0 0 0 0 1 0 0 0 0 2 498 0 0 0.00e+00 0 0.00e+00 0
> SFSetGraph 43 1.0 1.0816e-03 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> SFSetUp 34 1.0 1.5032e-01 1.5 0.00e+00 0.0 1.8e+04 2.1e+04 3.4e+01 0 0 3 3 2 0 0 8 7 4 0 0 0 0.00e+00 0 0.00e+00 0
> SFBcastBegin 65 1.0 2.2730e+00145.4 0.00e+00 0.0 1.3e+04 1.3e+04 0.0e+00 2 0 2 2 0 2 0 6 3 0 0 0 1 1.68e-01 4 7.24e+00 0
> SFBcastEnd 65 1.0 1.7421e+0062.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 6.24e-03 0 0.00e+00 0
> SFReduceBegin 16 1.0 1.9556e-0184.2 5.24e+05 1.0 4.2e+03 8.5e+04 0.0e+00 0 0 1 3 0 0 0 2 7 0 170 0 2 4.15e+00 0 0.00e+00 100
> SFReduceEnd 16 1.0 9.7152e-0132.7 2.50e+04 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1 0 0 0.00e+00 0 0.00e+00 100
> SFFetchOpBegin 2 1.0 3.1814e-03104.2 0.00e+00 0.0 5.6e+02 2.0e+05 0.0e+00 0 0 0 1 0 0 0 0 2 0 0 0 0 0.00e+00 0 0.00e+00 0
> SFFetchOpEnd 2 1.0 2.8296e-02 3.6 0.00e+00 0.0 5.6e+02 2.0e+05 0.0e+00 0 0 0 1 0 0 0 0 2 0 0 0 0 0.00e+00 0 0.00e+00 0
> SFCreateEmbed 8 1.0 1.0733e-0172.8 0.00e+00 0.0 2.0e+03 7.0e+02 0.0e+00 0 0 0 0 0 0 0 1 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> SFDistSection 9 1.0 1.0892e-02 2.3 0.00e+00 0.0 4.1e+03 5.9e+03 1.1e+01 0 0 1 0 1 0 0 2 0 1 0 0 0 0.00e+00 0 0.00e+00 0
> SFSectionSF 16 1.0 5.2589e-02 2.2 0.00e+00 0.0 5.8e+03 2.0e+04 1.6e+01 0 0 1 1 1 0 0 3 2 2 0 0 0 0.00e+00 0 0.00e+00 0
> SFRemoteOff 7 1.0 1.2178e-0124.0 0.00e+00 0.0 6.1e+03 1.3e+03 4.0e+00 0 0 1 0 0 0 0 3 0 1 0 0 0 0.00e+00 0 0.00e+00 0
> SFPack 290 1.0 7.5146e-01155.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 2 1.51e-01 0 0.00e+00 0
> SFUnpack 292 1.0 1.9789e-0158.9 5.49e+05 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 174 0 0 6.24e-03 0 0.00e+00 100
> VecTDot 401 1.0 1.4788e+00 1.8 2.10e+08 1.0 0.0e+00 0.0e+00 4.0e+02 2 1 0 0 20 2 2 0 0 53 8992 109803 0 0.00e+00 0 0.00e+00 100
> VecNorm 201 1.0 7.4026e-01 2.4 1.05e+08 1.0 0.0e+00 0.0e+00 2.0e+02 0 0 0 0 10 0 1 0 0 26 9004 127483 0 0.00e+00 0 0.00e+00 100
> VecCopy 2 1.0 1.4854e-0310.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecSet 54 1.0 8.7686e-03 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecAXPY 400 1.0 3.9120e-0120.9 2.10e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 2 0 0 0 33909 73190 0 0.00e+00 0 0.00e+00 100
> VecAYPX 199 1.0 1.3597e-01 6.9 1.04e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 1 0 0 0 48535 138139 0 0.00e+00 0 0.00e+00 100
> VecPointwiseMult 201 1.0 1.4152e-0110.2 5.27e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 1 0 0 0 23550 69371 0 0.00e+00 0 0.00e+00 100
> VecScatterBegin 201 1.0 6.5846e-0117.0 0.00e+00 0.0 1.9e+05 1.5e+04 2.0e+00 0 0 32 27 0 0 0 85 59 0 0 0 1 1.14e-01 0 0.00e+00 0
> VecScatterEnd 201 1.0 6.6968e-01 9.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> DualSpaceSetUp 2 1.0 5.2698e-03 1.2 1.80e+03 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 22 0 0 0.00e+00 0 0.00e+00 0
> FESetUp 2 1.0 3.3009e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> PCSetUp 1 1.0 9.6290e-06 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> PCApply 201 1.0 1.9920e-01 2.9 5.27e+07 1.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 1 0 0 0 16731 47897 0 0.00e+00 0 0.00e+00 100
>
> --- Event Stage 1: PCSetUp
>
> PCSetUp 1 1.0 3.6638e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 100 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
>
> --- Event Stage 2: KSP Solve only
>
> MatMult 400 1.0 1.3375e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 0.0e+00 1 55 62 54 0 24 91100100 0 625440 0 0 0.00e+00 0 0.00e+00 100
So this is 3.3 ms per iteration. It involves a few kernels with sync through the host, but on other GPUs, it's when you get under 1 ms (or perhaps 500 µs) that perf really suffers.
625 GF is 3750 TB/s. What is STREAM here? If each chiplet of the 4 dual-chiplet devices realizes at least MI100 bandwidth, then this is less than 50%. I wonder what others have seen.
> MatView 2 1.0 4.3457e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> KSPSolve 2 1.0 4.9810e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 1.2e+03 6 60 62 54 61 100100100100100 185277 1102535 0 0.00e+00 0 0.00e+00 100
> SFPack 400 1.0 2.6830e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> SFUnpack 400 1.0 2.2198e-04 4.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecTDot 802 1.0 2.0418e+00 1.2 4.20e+08 1.0 0.0e+00 0.0e+00 8.0e+02 2 2 0 0 40 38 3 0 0 67 13026 112538 0 0.00e+00 0 0.00e+00 100
> VecNorm 402 1.0 1.4270e+00 2.4 2.11e+08 1.0 0.0e+00 0.0e+00 4.0e+02 1 1 0 0 20 14 1 0 0 33 9343 134367 0 0.00e+00 0 0.00e+00 100
> VecCopy 4 1.0 5.9396e-0324.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecSet 4 1.0 3.7188e-0313.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecAXPY 800 1.0 7.4812e-0121.6 4.19e+08 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 14 3 0 0 0 35463 73999 0 0.00e+00 0 0.00e+00 100
> VecAYPX 398 1.0 2.5369e-01 6.5 2.09e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 4 1 0 0 0 52028 142028 0 0.00e+00 0 0.00e+00 100
Still weird that these are so different. Going off the total times, 400 µs is lots for just one kernel launch and yet the bandwidth is abysmal (using any of the numbers).
> VecPointwiseMult 402 1.0 2.9605e-01 3.6 1.05e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 1 0 0 0 22515 70608 0 0.00e+00 0 0.00e+00 100
> VecScatterBegin 400 1.0 1.6791e-01 6.0 0.00e+00 0.0 3.7e+05 1.6e+04 0.0e+00 0 0 62 54 0 2 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecScatterEnd 400 1.0 1.0057e+00 7.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> PCApply 402 1.0 2.9638e-01 3.6 1.05e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 1 0 0 0 22490 70608 0 0.00e+00 0 0.00e+00 100
Most of the MatMult time is attributed to VecScatterEnd here. Can you share a run of the same total problem size on 8 ranks (one rank per GPU)?
>From the other log file (10x bigger problem)
> --- Event Stage 2: KSP Solve only
>
> MatMult 400 1.0 9.4001e+00 1.2 1.07e+11 1.0 3.7e+05 6.1e+04 0.0e+00 2 55 62 54 0 65 91100100 0 721451 0 0 0.00e+00 0 0.00e+00 100
Similar bandwidth
> MatView 2 1.0 4.4729e-03 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> KSPSolve 2 1.0 1.3945e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04 1.2e+03 2 60 62 54 60 100100100100100 536128 2881308 0 0.00e+00 0 0.00e+00 100
I think the GPU Mflop/s number here is nonsense, but if it were accurate (and about 6 bytes/flop average), this would be 2.88*6=17.28 GB/s. The marketing material says 3.2 TB/s per (dual-chiplet) device, or 12.8. Those numbers are always nonsense, but our GPU Mflop/s exceeds it.
> SFPack 400 1.0 2.4445e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> SFUnpack 400 1.0 1.2255e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecTDot 802 1.0 2.7256e+00 1.9 3.36e+09 1.0 0.0e+00 0.0e+00 8.0e+02 0 2 0 0 40 14 3 0 0 67 78523 335526 0 0.00e+00 0 0.00e+00 100
> VecNorm 402 1.0 1.9145e+00 3.7 1.69e+09 1.0 0.0e+00 0.0e+00 4.0e+02 0 1 0 0 20 6 1 0 0 33 56035 533339 0 0.00e+00 0 0.00e+00 100
> VecCopy 4 1.0 6.3156e-03 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecSet 4 1.0 3.8228e-0315.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecAXPY 800 1.0 9.0587e-0111.1 3.36e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 6 3 0 0 0 235676 444654 0 0.00e+00 0 0.00e+00 100
> VecAYPX 398 1.0 1.9393e+0029.6 1.67e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 6 1 0 0 0 54767 65448 0 0.00e+00 0 0.00e+00 100
So if we look at GPU flops, this is around 3.5 TB/s for VecAXPY, which still seems low. Weird that it's now faster than VecAYPX (I don't trust these timings).
> VecPointwiseMult 402 1.0 3.5580e-01 6.2 8.43e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 2 1 0 0 0 150758 318605 0 0.00e+00 0 0.00e+00 100
> VecScatterBegin 400 1.0 1.3900e+0028.9 0.00e+00 0.0 3.7e+05 6.1e+04 0.0e+00 0 0 62 54 0 7 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecScatterEnd 400 1.0 5.8686e+00 6.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 13 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
Still more than half the MatMult is attributed to VecScatterEnd
> PCApply 402 1.0 3.5612e-01 6.1 8.43e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 2 1 0 0 0 150622 318605 0 0.00e+00 0 0.00e+00 100
More information about the petsc-dev
mailing list