[petsc-dev] Kokkos/Crusher perforance
Jed Brown
jed at jedbrown.org
Tue Jan 25 08:44:00 CST 2022
Mark Adams <mfadams at lbl.gov> writes:
> adding Suyash,
>
> I found the/a problem. Using ex56, which has a crappy decomposition, using
> one MPI process/GPU is much faster than using 8 (64 total). (I am looking
> at ex13 to see how much of this is due to the decomposition)
> If you only use 8 processes it seems that all 8 are put on the first GPU,
> but adding -c8 seems to fix this.
> Now the numbers are looking reasonable.
Hah, we need -log_view to report bus ID for each GPU so we don't spend another day of mailing list traffic to identify.
This looks to be 2-3x the performance of Spock.
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU
> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
> --- Event Stage 2: Solve
>
> BuildTwoSided 1 1.0 9.1706e-05 1.6 0.00e+00 0.0 5.6e+01 4.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> MatMult 200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 1.0e+00 9 92 99 79 0 71 92100100 0 579635 1014212 1 2.04e-04 0 0.00e+00 100
GPU compute bandwidth of around 6 TB/s is okay, but disappointing that communication is so expensive.
> MatView 1 1.0 7.8531e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> KSPSolve 1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 6.0e+02 12100 99 79 94 100100100100100 449667 893741 1 2.04e-04 0 0.00e+00 100
> PCApply 201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 2.0e+00 2 1 0 0 0 18 1 0 0 0 14558 163941 0 0.00e+00 0 0.00e+00 100
> VecTDot 401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 4.0e+02 1 2 0 0 62 5 2 0 0 66 183716 353914 0 0.00e+00 0 0.00e+00 100
> VecNorm 201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 2.0e+02 0 1 0 0 31 2 1 0 0 33 222325 303155 0 0.00e+00 0 0.00e+00 100
> VecCopy 2 1.0 2.3551e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecSet 1 1.0 9.8740e-05 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecAXPY 400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 2 2 0 0 0 427091 514744 0 0.00e+00 0 0.00e+00 100
> VecAYPX 199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 1 1 0 0 0 432323 532889 0 0.00e+00 0 0.00e+00 100
These two are finally about the same speed, but these numbers imply kernel overhead of about 57 µs (because these do nothing else).
> VecPointwiseMult 201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 1 1 0 0 0 235882 290088 0 0.00e+00 0 0.00e+00 100
> VecScatterBegin 200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 1.0e+00 2 0 99 79 0 19 0100100 0 0 0 1 2.04e-04 0 0.00e+00 0
> VecScatterEnd 200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
I'm curious how these change with problem size. (To what extent are we latency vs bandwidth limited?)
> SFSetUp 1 1.0 1.3015e-03 1.3 0.00e+00 0.0 1.1e+02 1.7e+04 1.0e+00 0 0 1 0 0 0 0 1 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> SFPack 200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 18 0 0 0 0 0 0 1 2.04e-04 0 0.00e+00 0
> SFUnpack 200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
More information about the petsc-dev
mailing list