[petsc-dev] Kokkos/Crusher perforance

Fri Jan 21 18:58:22 CST 2022

On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <mfadams at lbl.gov> wrote:

> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
> MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
> are similar (mat-vec is a little faster w/o, the total is about the same,
> call it noise)
>
> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64
> cores on the node, then when using 1 core/GPU. With the same size problem
> of course.
> I was thinking MatMult should be faster with just one MPI process. Oh
> well, worry about that later.
>
> The bigger problem, and I have observed this to some extent with the
> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
> expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated by
> not-mat-vec:
>
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop
>        --- Global ---  --- Stage ----  *Total   GPU *   - CpuToGpu -   -
> GpuToCpu - GPU
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
> Count   Size  %F
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
> grep "MatMult              400" jac_out_00*5_8_gpuawaremp*
> MatMult              400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874       0*      0
> 0.00e+00    0 0.00e+00 100
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
> grep "KSPSolve               2" jac_out_001*_5_8_gpuawaremp*
> KSPSolve               2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*      0
> 0.00e+00    0 0.00e+00 100
>
> Notes about flop counters here,
> * that MatMult flops are not logged as GPU flops but something is logged
> nonetheless.
> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are
> at < 1%.
>

This looks complicated, so just a single remark:

My understanding of the benchmarking of vector ops led by Hannah was that
you needed to be much
bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I
would think you would be
at 10% of peak or something right off the bat at these sizes. Barry, is
that right?

  Thanks,

     Matt

> Anway, not sure how to proceed but I thought I would share.
> Maybe ask the Kokkos guys if the have looked at Crusher.
>
> Mark
>
-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220121/20e26454/attachment.html>