[petsc-dev] Kokkos/Crusher perforance

Mon Jan 24 14:06:39 CST 2022

Also, do you guys have an OLCF liaison? That's actually your better bet if
you do.

Performance issues with ROCm/Kokkos are pretty common in apps besides just
PETSc. We have several teams actively working on rectifying this. However,
I think performance issues can be quicker to identify if we had a more
"official" and reproducible PETSc GPU benchmark, which I've already
expressed to some folks in this thread, and as others already commented on
the difficulty of such a task. Hopefully I will have more time soon to
illustrate what I am thinking.

On Mon, Jan 24, 2022 at 1:57 PM Justin Chang <jychang48 at gmail.com> wrote:

> My name has been called.
>
> Mark, if you're having issues with Crusher, please contact Veronica
> Vergara (vergaravg at ornl.gov). You can cc me (justin.chang at amd.com) in
> those emails
>
> On Mon, Jan 24, 2022 at 1:49 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>
>> On Jan 24, 2022, at 2:46 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could
>> run this on one processor to get cleaner numbers.
>>
>> Is there a designated ECP technical support contact?
>>
>>
>>    Mark, you've forgotten you work for DOE. There isn't a non-ECP
>> technical support contact.
>>
>>    But if this is an AMD machine then maybe contact Matt's student Justin
>> Chang?
>>
>>
>>
>>
>>
>> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>   I think you should contact the crusher ECP technical support team and
>>> tell them you are getting dismel performance and ask if you should expect
>>> better. Don't waste time flogging a dead horse.
>>>
>>> On Jan 24, 2022, at 2:16 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>>
>>> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang <junchao.zhang at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Mark, I think you can benchmark individual vector operations, and
>>>>>> once we get reasonable profiling results, we can move to solvers etc.
>>>>>>
>>>>>
>>>>> Can you suggest a code to run or are you suggesting making a vector
>>>>> benchmark code?
>>>>>
>>>> Make a vector benchmark code, testing vector operations that would be
>>>> used in your solver.
>>>> Also, we can run MatMult() to see if the profiling result is reasonable.
>>>> Only once we get some solid results on basic operations, it is useful
>>>> to run big codes.
>>>>
>>>
>>> So we have to make another throw-away code? Why not just look at the
>>> vector ops in Mark's actual code?
>>>
>>>    Matt
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> --Junchao Zhang
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith <bsmith at petsc.dev>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>   Here except for VecNorm the GPU is used effectively in that most
>>>>>>>> of the time is time is spent doing real work on the GPU
>>>>>>>>
>>>>>>>> VecNorm              402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00
>>>>>>>> 0.0e+00 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393      0
>>>>>>>> 0.00e+00    0 0.00e+00 100
>>>>>>>>
>>>>>>>> Even the dots are very effective, only the VecNorm flop rate over
>>>>>>>> the full time is much much lower than the vecdot. Which is somehow due to
>>>>>>>> the use of the GPU or CPU MPI in the allreduce?
>>>>>>>>
>>>>>>>
>>>>>>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate
>>>>>>> is about the same as the other vec ops. I don't know what to make of that.
>>>>>>>
>>>>>>> But Crusher is clearly not crushing it.
>>>>>>>
>>>>>>> Junchao: Perhaps we should ask Kokkos if they have any experience
>>>>>>> with Crusher that they can share. They could very well find some low level
>>>>>>> magic.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jan 24, 2022, at 12:14 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Mark, can we compare with Spock?
>>>>>>>>>
>>>>>>>>
>>>>>>>>  Looks much better. This puts two processes/GPU because there are
>>>>>>>> only 4.
>>>>>>>> <jac_out_001_kokkos_Spock_6_1_notpl.txt>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220124/55085ffa/attachment.html>