[petsc-dev] Kokkos/Crusher perforance

Mark Adams mfadams at lbl.gov
Sat Jan 22 10:03:57 CST 2022


Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End()  right?

On Fri, Jan 21, 2022 at 9:47 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>   Mark,
>
>   Fix the logging before you run more. It will help with seeing the true
> disparity between the MatMult and the vector ops.
>
>
> On Jan 21, 2022, at 9:37 PM, Mark Adams <mfadams at lbl.gov> wrote:
>
> Here is one with 2M / GPU. Getting better.
>
> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>    Matt is correct, vectors are way too small.
>>
>>    BTW: Now would be a good time to run some of the Report I benchmarks
>> on Crusher to get a feel for the kernel launch times and performance on
>> VecOps.
>>
>>    Also Report 2.
>>
>>   Barry
>>
>>
>> On Jan 21, 2022, at 7:58 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>
>> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
>>> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
>>> MI200?).
>>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
>>> are similar (mat-vec is a little faster w/o, the total is about the same,
>>> call it noise)
>>>
>>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all
>>> 64 cores on the node, then when using 1 core/GPU. With the same size
>>> problem of course.
>>> I was thinking MatMult should be faster with just one MPI process. Oh
>>> well, worry about that later.
>>>
>>> The bigger problem, and I have observed this to some extent with the
>>> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
>>> expensive or crazy expensive.
>>> You can see (attached) and the times here that the solve is dominated by
>>> not-mat-vec:
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event                Count      Time (sec)     Flop
>>>          --- Global ---  --- Stage ----  *Total   GPU *   - CpuToGpu -
>>>   - GpuToCpu - GPU
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
>>>   Count   Size  %F
>>>
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>>> grep "MatMult              400" jac_out_00*5_8_gpuawaremp*
>>> MatMult              400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
>>> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874       0*      0
>>> 0.00e+00    0 0.00e+00 100
>>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>>> grep "KSPSolve               2" jac_out_001*_5_8_gpuawaremp*
>>> KSPSolve               2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
>>> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*
>>>  0 0.00e+00    0 0.00e+00 100
>>>
>>> Notes about flop counters here,
>>> * that MatMult flops are not logged as GPU flops but something is logged
>>> nonetheless.
>>> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
>>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we
>>> are at < 1%.
>>>
>>
>> This looks complicated, so just a single remark:
>>
>> My understanding of the benchmarking of vector ops led by Hannah was that
>> you needed to be much
>> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs
>> I would think you would be
>> at 10% of peak or something right off the bat at these sizes. Barry, is
>> that right?
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>> Anway, not sure how to proceed but I thought I would share.
>>> Maybe ask the Kokkos guys if the have looked at Crusher.
>>>
>>> Mark
>>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>>
>> <jac_out_001_kokkos_Crusher_6_8_gpuawarempi.txt>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220122/fe58a43e/attachment.html>


More information about the petsc-dev mailing list