[petsc-dev] Kokkos/Crusher perforance

Tue Jan 25 08:44:00 CST 2022

Mark Adams <mfadams at lbl.gov> writes:

> adding Suyash,
>
> I found the/a problem. Using ex56, which has a crappy decomposition, using
> one MPI process/GPU is much faster than using 8 (64 total). (I am looking
> at ex13 to see how much of this is due to the decomposition)
> If you only use 8 processes it seems that all 8 are put on the first GPU,
> but adding -c8 seems to fix this.
> Now the numbers are looking reasonable.

Hah, we need -log_view to report bus ID for each GPU so we don't spend another day of mailing list traffic to identify.

This looks to be 2-3x the performance of Spock.

> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------

[...]

> --- Event Stage 2: Solve
>
> BuildTwoSided          1 1.0 9.1706e-05 1.6 0.00e+00 0.0 5.6e+01 4.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> MatMult              200 1.0 6.7831e-01 1.0 4.91e+10 1.0 1.1e+04 6.6e+04 1.0e+00  9 92 99 79  0  71 92100100  0 579635   1014212      1 2.04e-04    0 0.00e+00 100

GPU compute bandwidth of around 6 TB/s is okay, but disappointing that communication is so expensive.

> MatView                1 1.0 7.8531e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> KSPSolve               1 1.0 9.4550e-01 1.0 5.31e+10 1.0 1.1e+04 6.6e+04 6.0e+02 12100 99 79 94 100100100100100 449667   893741      1 2.04e-04    0 0.00e+00 100
> PCApply              201 1.0 1.6966e-01 1.0 3.09e+08 1.0 0.0e+00 0.0e+00 2.0e+00  2  1  0  0  0  18  1  0  0  0 14558   163941      0 0.00e+00    0 0.00e+00 100
> VecTDot              401 1.0 5.3642e-02 1.3 1.23e+09 1.0 0.0e+00 0.0e+00 4.0e+02  1  2  0  0 62   5  2  0  0 66 183716   353914      0 0.00e+00    0 0.00e+00 100
> VecNorm              201 1.0 2.2219e-02 1.1 6.17e+08 1.0 0.0e+00 0.0e+00 2.0e+02  0  1  0  0 31   2  1  0  0 33 222325   303155      0 0.00e+00    0 0.00e+00 100
> VecCopy                2 1.0 2.3551e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecSet                 1 1.0 9.8740e-05 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> VecAXPY              400 1.0 2.3017e-02 1.1 1.23e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0   2  2  0  0  0 427091   514744      0 0.00e+00    0 0.00e+00 100
> VecAYPX              199 1.0 1.1312e-02 1.1 6.11e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   1  1  0  0  0 432323   532889      0 0.00e+00    0 0.00e+00 100

These two are finally about the same speed, but these numbers imply kernel overhead of about 57 µs (because these do nothing else).

> VecPointwiseMult     201 1.0 1.0471e-02 1.1 3.09e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   1  1  0  0  0 235882   290088      0 0.00e+00    0 0.00e+00 100
> VecScatterBegin      200 1.0 1.8458e-01 1.1 0.00e+00 0.0 1.1e+04 6.6e+04 1.0e+00  2  0 99 79  0  19  0100100  0     0       0      1 2.04e-04    0 0.00e+00  0
> VecScatterEnd        200 1.0 1.9007e-02 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0

I'm curious how these change with problem size. (To what extent are we latency vs bandwidth limited?)

> SFSetUp                1 1.0 1.3015e-03 1.3 0.00e+00 0.0 1.1e+02 1.7e+04 1.0e+00  0  0  1  0  0   0  0  1  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> SFPack               200 1.0 1.7309e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0  18  0  0  0  0     0       0      1 2.04e-04    0 0.00e+00  0
> SFUnpack             200 1.0 2.3165e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0