[petsc-dev] Kokkos/Crusher perforance

Fri Jan 21 20:08:36 CST 2022

  Junchao, Mark,

     Some of the logging information is non-sensible, MatMult says all flops are done on the GPU (last column) but the GPU flop rate is zero. 

     It looks like  MatMult_SeqAIJKokkos() is missing PetscLogGpuTimeBegin()/End() in fact all the operations in aijkok.kokkos.cxx seem to be missing it. This might explain the crazy 0 GPU flop rate. Can this be fixed ASAP?

     Regarding VecOps, sure looks the kernel launches are killing performance. 

           But in particular look at the VecTDot and VecNorm CPU flop rates compared to the GPU, much lower, this tells me the MPI_Allreduce is likely hurting performance in there also a great deal. It would be good to see a single MPI rank job to compare to see performance without the MPI overhead.


> On Jan 21, 2022, at 6:41 PM, Mark Adams <mfadams at lbl.gov> wrote:
> 
> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are similar (mat-vec is a little faster w/o, the total is about the same, call it noise)
> 
> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 cores on the node, then when using 1 core/GPU. With the same size problem of course.
> I was thinking MatMult should be faster with just one MPI process. Oh well, worry about that later.
> 
> The bigger problem, and I have observed this to some extent with the Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated by not-mat-vec:
> 
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "MatMult              400" jac_out_00*5_8_gpuawaremp*
> MatMult              400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 668874       0      0 0.00e+00    0 0.00e+00 100
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "KSPSolve               2" jac_out_001*_5_8_gpuawaremp*
> KSPSolve               2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 208923   1094405      0 0.00e+00    0 0.00e+00 100
> 
> Notes about flop counters here, 
> * that MatMult flops are not logged as GPU flops but something is logged nonetheless.
> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are at < 1%.
> 
> Anway, not sure how to proceed but I thought I would share.
> Maybe ask the Kokkos guys if the have looked at Crusher.
> 
> Mark
> 
> 
> <jac_out_001_kokkos_Crusher_5_8_gpuawarempi.txt>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220121/56a7ede2/attachment.html>