[petsc-dev] Kokkos/Crusher perforance

Mon Jan 24 12:55:05 CST 2022

On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang <junchao.zhang at gmail.com>
wrote:

> Mark, I think you can benchmark individual vector operations, and once we
> get reasonable profiling results, we can move to solvers etc.
>

Can you suggest a code to run or are you suggesting making a vector
benchmark code?

>
> --Junchao Zhang
>
>
> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>>
>>
>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>   Here except for VecNorm the GPU is used effectively in that most of
>>> the time is time is spent doing real work on the GPU
>>>
>>> VecNorm              402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00
>>> 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393      0 0.00e+00    0
>>> 0.00e+00 100
>>>
>>> Even the dots are very effective, only the VecNorm flop rate over the
>>> full time is much much lower than the vecdot. Which is somehow due to the
>>> use of the GPU or CPU MPI in the allreduce?
>>>
>>
>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate is
>> about the same as the other vec ops. I don't know what to make of that.
>>
>> But Crusher is clearly not crushing it.
>>
>> Junchao: Perhaps we should ask Kokkos if they have any experience with
>> Crusher that they can share. They could very well find some low level magic.
>>
>>
>>
>>>
>>>
>>> On Jan 24, 2022, at 12:14 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>
>>>
>>>> Mark, can we compare with Spock?
>>>>
>>>
>>>  Looks much better. This puts two processes/GPU because there are only 4.
>>> <jac_out_001_kokkos_Spock_6_1_notpl.txt>
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220124/a55bcc1f/attachment.html>