[petsc-users] Unexpected performance losses switching to COO interface

Thu Oct 5 16:29:30 CDT 2023

Wait a moment, it seems it was because we do not have a GPU implementation
of MatShift...
Let me see how to add it.
--Junchao Zhang

On Thu, Oct 5, 2023 at 10:58 AM Junchao Zhang <junchao.zhang at gmail.com>
wrote:

> Hi, Philip,
>   I looked at the hpcdb-NE_3-cuda file. It seems you used MatSetValues()
> instead of the COO interface?  MatSetValues() needs to copy the data from
> device to host and thus is expensive.
>   Do you have profiling results with COO enabled?
>
> [image: Screenshot 2023-10-05 at 10.55.29 AM.png]
>
>
> --Junchao Zhang
>
>
> On Mon, Oct 2, 2023 at 9:52 AM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>> Hi, Philip,
>>   I will look into the tarballs and get back to you.
>>    Thanks.
>> --Junchao Zhang
>>
>>
>> On Mon, Oct 2, 2023 at 9:41 AM Fackler, Philip via petsc-users <
>> petsc-users at mcs.anl.gov> wrote:
>>
>>> We finally have xolotl ported to use the new COO interface and the
>>> aijkokkos implementation for Mat (and kokkos for Vec). Comparing this port
>>> to our previous version (using MatSetValuesStencil and the default Mat and
>>> Vec implementations), we expected to see an improvement in performance for
>>> both the "serial" and "cuda" builds (here I'm referring to the kokkos
>>> configuration).
>>>
>>> Attached are two plots that show timings for three different cases. All
>>> of these were run on Ascent (the Summit-like training system) with 6 MPI
>>> tasks (on a single node). The CUDA cases were given one GPU per task (and
>>> used CUDA-aware MPI). The labels on the blue bars indicate speedup. In all
>>> cases we used "-fieldsplit_0_pc_type jacobi" to keep the comparison as
>>> consistent as possible.
>>>
>>> The performance of RHSJacobian (where the bulk of computation happens in
>>> xolotl) behaved basically as expected (better than expected in the serial
>>> build). NE_3 case in CUDA was the only one that performed worse, but not
>>> surprisingly, since its workload for the GPUs is much smaller. We've still
>>> got more optimization to do on this.
>>>
>>> The real surprise was how much worse the overall solve times were. This
>>> seems to be due simply to switching to the kokkos-based implementation. I'm
>>> wondering if there are any changes we can make in configuration or runtime
>>> arguments to help with PETSc's performance here. Any help looking into this
>>> would be appreciated.
>>>
>>> The tarballs linked here
>>> <https://drive.google.com/file/d/19X_L3SVkGBM9YUzXnRR_kVWFG0JFwqZ3/view?usp=drive_link>
>>> and here
>>> <https://drive.google.com/file/d/15yDBN7-YlO1g6RJNPYNImzr611i1Ffhv/view?usp=drive_link>
>>> are profiling databases which, once extracted, can be viewed with
>>> hpcviewer. I don't know how helpful that will be, but hopefully it can give
>>> you some direction.
>>>
>>> Thanks for your help,
>>>
>>>
>>> *Philip Fackler *
>>> Research Software Engineer, Application Engineering Group
>>> Advanced Computing Systems Research Section
>>> Computer Science and Mathematics Division
>>> *Oak Ridge National Laboratory*
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231005/abcbf75f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot 2023-10-05 at 10.55.29?AM.png
Type: image/png
Size: 144341 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231005/abcbf75f/attachment-0001.png>