[petsc-dev] Kokkos/Crusher perforance

Mon Jan 24 13:46:23 CST 2022

Yea, CG/Jacobi is as close to a benchmark code as we could want. I could
run this on one processor to get cleaner numbers.

Is there a designated ECP technical support contact?

On Mon, Jan 24, 2022 at 2:18 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>   I think you should contact the crusher ECP technical support team and
> tell them you are getting dismel performance and ask if you should expect
> better. Don't waste time flogging a dead horse.
>
> On Jan 24, 2022, at 2:16 PM, Matthew Knepley <knepley at gmail.com> wrote:
>
> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>>
>>
>> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>>
>>>
>>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>> Mark, I think you can benchmark individual vector operations, and once
>>>> we get reasonable profiling results, we can move to solvers etc.
>>>>
>>>
>>> Can you suggest a code to run or are you suggesting making a vector
>>> benchmark code?
>>>
>> Make a vector benchmark code, testing vector operations that would be
>> used in your solver.
>> Also, we can run MatMult() to see if the profiling result is reasonable.
>> Only once we get some solid results on basic operations, it is useful to
>> run big codes.
>>
>
> So we have to make another throw-away code? Why not just look at the
> vector ops in Mark's actual code?
>
>    Matt
>
>
>>
>>>
>>>>
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>>
>>>>>>
>>>>>>   Here except for VecNorm the GPU is used effectively in that most of
>>>>>> the time is time is spent doing real work on the GPU
>>>>>>
>>>>>> VecNorm              402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00
>>>>>> 0.0e+00 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393      0
>>>>>> 0.00e+00    0 0.00e+00 100
>>>>>>
>>>>>> Even the dots are very effective, only the VecNorm flop rate over the
>>>>>> full time is much much lower than the vecdot. Which is somehow due to the
>>>>>> use of the GPU or CPU MPI in the allreduce?
>>>>>>
>>>>>
>>>>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate is
>>>>> about the same as the other vec ops. I don't know what to make of that.
>>>>>
>>>>> But Crusher is clearly not crushing it.
>>>>>
>>>>> Junchao: Perhaps we should ask Kokkos if they have any experience with
>>>>> Crusher that they can share. They could very well find some low level magic.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> On Jan 24, 2022, at 12:14 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Mark, can we compare with Spock?
>>>>>>>
>>>>>>
>>>>>>  Looks much better. This puts two processes/GPU because there are
>>>>>> only 4.
>>>>>> <jac_out_001_kokkos_Spock_6_1_notpl.txt>
>>>>>>
>>>>>>
>>>>>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220124/1267ff83/attachment-0001.html>