[petsc-dev] Kokkos/Crusher perforance

Sat Jan 22 10:10:45 CST 2022

On Sat, Jan 22, 2022 at 10:04 AM Mark Adams <mfadams at lbl.gov> wrote:

> Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End()  right?
>
No, PetscLogGpuTime() does not know the flops of the caller.

>
> On Fri, Jan 21, 2022 at 9:47 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>   Mark,
>>
>>   Fix the logging before you run more. It will help with seeing the true
>> disparity between the MatMult and the vector ops.
>>
>>
>> On Jan 21, 2022, at 9:37 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>> Here is one with 2M / GPU. Getting better.
>>
>> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>    Matt is correct, vectors are way too small.
>>>
>>>    BTW: Now would be a good time to run some of the Report I benchmarks
>>> on Crusher to get a feel for the kernel launch times and performance on
>>> VecOps.
>>>
>>>    Also Report 2.
>>>
>>>   Barry
>>>
>>>
>>> On Jan 21, 2022, at 7:58 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>>
>>> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
>>>> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
>>>> MI200?).
>>>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware
>>>> MPI are similar (mat-vec is a little faster w/o, the total is about the
>>>> same, call it noise)
>>>>
>>>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all
>>>> 64 cores on the node, then when using 1 core/GPU. With the same size
>>>> problem of course.
>>>> I was thinking MatMult should be faster with just one MPI process. Oh
>>>> well, worry about that later.
>>>>
>>>> The bigger problem, and I have observed this to some extent with the
>>>> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
>>>> expensive or crazy expensive.
>>>> You can see (attached) and the times here that the solve is dominated
>>>> by not-mat-vec:
>>>>
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event                Count      Time (sec)     Flop
>>>>          --- Global ---  --- Stage ----  *Total   GPU *   - CpuToGpu -
>>>>   - GpuToCpu - GPU
>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>>>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
>>>>   Count   Size  %F
>>>>
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>> 17:15 main=
>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "MatMult
>>>>              400" jac_out_00*5_8_gpuawaremp*
>>>> MatMult              400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
>>>> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874       0*      0
>>>> 0.00e+00    0 0.00e+00 100
>>>> 17:15 main=
>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "KSPSolve
>>>>               2" jac_out_001*_5_8_gpuawaremp*
>>>> KSPSolve               2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
>>>> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*
>>>>  0 0.00e+00    0 0.00e+00 100
>>>>
>>>> Notes about flop counters here,
>>>> * that MatMult flops are not logged as GPU flops but something is
>>>> logged nonetheless.
>>>> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
>>>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we
>>>> are at < 1%.
>>>>
>>>
>>> This looks complicated, so just a single remark:
>>>
>>> My understanding of the benchmarking of vector ops led by Hannah was
>>> that you needed to be much
>>> bigger than 16M to hit peak. I need to get the tech report, but on 8
>>> GPUs I would think you would be
>>> at 10% of peak or something right off the bat at these sizes. Barry, is
>>> that right?
>>>
>>>   Thanks,
>>>
>>>      Matt
>>>
>>>
>>>> Anway, not sure how to proceed but I thought I would share.
>>>> Maybe ask the Kokkos guys if the have looked at Crusher.
>>>>
>>>> Mark
>>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>>
>>> <jac_out_001_kokkos_Crusher_6_8_gpuawarempi.txt>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220122/8a17e8ff/attachment.html>