[petsc-dev] Kokkos/Crusher perforance
Junchao Zhang
junchao.zhang at gmail.com
Sun Jan 23 21:01:15 CST 2022
On Sat, Jan 22, 2022 at 9:00 PM Junchao Zhang <junchao.zhang at gmail.com>
wrote:
>
>
>
> On Sat, Jan 22, 2022 at 5:00 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>> The GPU flop rate (when 100 percent flops on the GPU) should always be
>> higher than the overall flop rate (the previous column). For large problems
>> they should be similar, for small problems the GPU one may be much higher.
>>
>> If the CPU one is higher (when 100 percent flops on the GPU) something
>> must be wrong with the logging. I looked at the code for the two cases and
>> didn't see anything obvious.
>>
>> Junchao and Jacob,
>> I think some of the timing code in the Kokkos interface is wrong.
>>
>> * The PetscLogGpuTimeBegin/End should be inside the viewer access
>> code not outside it. (The GPU time is an attempt to best time the kernels,
>> not other processing around the use of the kernels, that other stuff is
>> captured in the general LogEventBegin/End.
>>
> What about potential host to device memory copy before calling a kernel?
Should we count it in the kernel time?
Good point
>
>> * The use of WaitForKokkos() is confusing and seems inconsistent.
>>
> I need to have a look. Until now, I have not paid much attention to kokkos
> profiling.
>
>> -For example it is used in VecTDot_SeqKokkos() which I would
>> think has a barrier anyways because it puts a scalar result into update?
>> -Plus PetscLogGpuTimeBegin/End is suppose to already have
>> suitable system (that Hong added) to ensure the kernel is complete; reading
>> the manual page and looking at Jacobs cupmcontext.hpp it seems to be there
>> so I don't think WaitForKokkos() is needed in most places (or is Kokkos
>> asynchronous and needs this for correctness?)
>> But these won't explain the strange result of overall flop rate being
>> higher than GPU flop rate.
>>
>> Barry
>>
>>
>>
>>
>>
>> On Jan 22, 2022, at 11:44 AM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>> I am getting some funny timings and I'm trying to figure it out.
>> I figure the gPU flop rates are bit higher because the timers are inside
>> of the CPU timers, but *some are a lot bigger or inverted*
>>
>> --- Event Stage 2: KSP Solve only
>>
>> MatMult 400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04
>> 0.0e+00 2 55 62 54 0 68 91100100 0 671849 857147 0 0.00e+00 0
>> 0.00e+00 100
>> MatView 2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
>> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> KSPSolve 2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04
>> 1.2e+03 2 60 62 54 60 100100100100100 512399 804048 0 0.00e+00 0
>> 0.00e+00 100
>> SFPack 400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> SFUnpack 400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> VecTDot 802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00
>> 8.0e+02 0 2 0 0 40 13 3 0 0 67 *69996 488328* 0 0.00e+00
>> 0 0.00e+00 100
>> VecNorm 402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00
>> 4.0e+02 0 1 0 0 20 6 1 0 0 33 54744 571507 0 0.00e+00 0
>> 0.00e+00 100
>> VecCopy 4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> VecSet 4 1.0 3.8051e-0316.9 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> VecAXPY 800 1.0 8.6160e-0113.6 3.36e+09 1.0 0.0e+00 0.0e+00
>> 0.0e+00 0 2 0 0 0 6 3 0 0 0 *247787 448304* 0 0.00e+00
>> 0 0.00e+00 100
>> VecAYPX 398 1.0 1.6831e+0031.1 1.67e+09 1.0 0.0e+00 0.0e+00
>> 0.0e+00 0 1 0 0 0 5 1 0 0 0 63107 77030 0 0.00e+00 0
>> 0.00e+00 100
>> VecPointwiseMult 402 1.0 3.8729e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 2 1 0 0 0 138502 262413 0 0.00e+00 0
>> 0.00e+00 100
>> VecScatterBegin 400 1.0 1.1947e+0035.1 0.00e+00 0.0 3.7e+05 6.1e+04
>> 0.0e+00 0 0 62 54 0 5 0100100 0 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> VecScatterEnd 400 1.0 6.2969e+00 8.8 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> PCApply 402 1.0 3.8758e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 2 1 0 0 0 138396 262413 0 0.00e+00 0
>> 0.00e+00 100
>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>
>> On Sat, Jan 22, 2022 at 11:10 AM Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>>>
>>>
>>>
>>> On Sat, Jan 22, 2022 at 10:04 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End()
>>>> right?
>>>>
>>> No, PetscLogGpuTime() does not know the flops of the caller.
>>>
>>>
>>>>
>>>> On Fri, Jan 21, 2022 at 9:47 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>> Mark,
>>>>>
>>>>> Fix the logging before you run more. It will help with seeing the
>>>>> true disparity between the MatMult and the vector ops.
>>>>>
>>>>>
>>>>> On Jan 21, 2022, at 9:37 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>> Here is one with 2M / GPU. Getting better.
>>>>>
>>>>> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>>
>>>>>>
>>>>>> Matt is correct, vectors are way too small.
>>>>>>
>>>>>> BTW: Now would be a good time to run some of the Report I
>>>>>> benchmarks on Crusher to get a feel for the kernel launch times and
>>>>>> performance on VecOps.
>>>>>>
>>>>>> Also Report 2.
>>>>>>
>>>>>> Barry
>>>>>>
>>>>>>
>>>>>> On Jan 21, 2022, at 7:58 PM, Matthew Knepley <knepley at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>
>>>>>>> I am looking at performance of a CG/Jacobi solve on a 3D Q2
>>>>>>> Laplacian (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is
>>>>>>> it MI200?).
>>>>>>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware
>>>>>>> MPI are similar (mat-vec is a little faster w/o, the total is about the
>>>>>>> same, call it noise)
>>>>>>>
>>>>>>> I found that MatMult was about 3x faster using 8 cores/GPU, that is
>>>>>>> all 64 cores on the node, then when using 1 core/GPU. With the same size
>>>>>>> problem of course.
>>>>>>> I was thinking MatMult should be faster with just one MPI process.
>>>>>>> Oh well, worry about that later.
>>>>>>>
>>>>>>> The bigger problem, and I have observed this to some extent with the
>>>>>>> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
>>>>>>> expensive or crazy expensive.
>>>>>>> You can see (attached) and the times here that the solve is
>>>>>>> dominated by not-mat-vec:
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>>>> Event Count Time (sec) Flop
>>>>>>> --- Global --- --- Stage ---- *Total GPU * -
>>>>>>> CpuToGpu - - GpuToCpu - GPU
>>>>>>> Max Ratio Max Ratio Max Ratio Mess
>>>>>>> AvgLen Reduct %T %F %M %L %R %T %F %M %L %R *Mflop/s Mflop/s*
>>>>>>> Count Size Count Size %F
>>>>>>>
>>>>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>>>> 17:15 main=
>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "MatMult
>>>>>>> 400" jac_out_00*5_8_gpuawaremp*
>>>>>>> MatMult 400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
>>>>>>> 1.6e+04 0.0e+00 1 55 62 54 0 27 91100100 0 *668874 0*
>>>>>>> 0 0.00e+00 0 0.00e+00 100
>>>>>>> 17:15 main=
>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "KSPSolve
>>>>>>> 2" jac_out_001*_5_8_gpuawaremp*
>>>>>>> KSPSolve 2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
>>>>>>> 1.6e+04 1.2e+03 4 60 62 54 61 100100100100100 *208923 1094405*
>>>>>>> 0 0.00e+00 0 0.00e+00 100
>>>>>>>
>>>>>>> Notes about flop counters here,
>>>>>>> * that MatMult flops are not logged as GPU flops but something is
>>>>>>> logged nonetheless.
>>>>>>> * The GPU flop rate is 5x the total flop rate in KSPSolve :\
>>>>>>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so
>>>>>>> we are at < 1%.
>>>>>>>
>>>>>>
>>>>>> This looks complicated, so just a single remark:
>>>>>>
>>>>>> My understanding of the benchmarking of vector ops led by Hannah was
>>>>>> that you needed to be much
>>>>>> bigger than 16M to hit peak. I need to get the tech report, but on 8
>>>>>> GPUs I would think you would be
>>>>>> at 10% of peak or something right off the bat at these sizes. Barry,
>>>>>> is that right?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>
>>>>>>> Anway, not sure how to proceed but I thought I would share.
>>>>>>> Maybe ask the Kokkos guys if the have looked at Crusher.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>> --
>>>>>> What most experimenters take for granted before they begin their
>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>> experiments lead.
>>>>>> -- Norbert Wiener
>>>>>>
>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>
>>>>>>
>>>>>> <jac_out_001_kokkos_Crusher_6_8_gpuawarempi.txt>
>>>>>
>>>>>
>>>>> <jac_out_001_kokkos_Crusher_5_8_notpl.txt>
>> <jac_out_001_kokkos_Crusher_6_8_notpl.txt>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220123/2e24c6f3/attachment-0001.html>
More information about the petsc-dev
mailing list