[petsc-users] Unexpected performance losses switching to COO interface

Thu Oct 5 10:58:50 CDT 2023

Hi, Philip,
  I looked at the hpcdb-NE_3-cuda file. It seems you used MatSetValues()
instead of the COO interface?  MatSetValues() needs to copy the data from
device to host and thus is expensive.
  Do you have profiling results with COO enabled?

[image: Screenshot 2023-10-05 at 10.55.29 AM.png]

--Junchao Zhang

On Mon, Oct 2, 2023 at 9:52 AM Junchao Zhang <junchao.zhang at gmail.com>
wrote:

> Hi, Philip,
>   I will look into the tarballs and get back to you.
>    Thanks.
> --Junchao Zhang
>
>
> On Mon, Oct 2, 2023 at 9:41 AM Fackler, Philip via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
>
>> We finally have xolotl ported to use the new COO interface and the
>> aijkokkos implementation for Mat (and kokkos for Vec). Comparing this port
>> to our previous version (using MatSetValuesStencil and the default Mat and
>> Vec implementations), we expected to see an improvement in performance for
>> both the "serial" and "cuda" builds (here I'm referring to the kokkos
>> configuration).
>>
>> Attached are two plots that show timings for three different cases. All
>> of these were run on Ascent (the Summit-like training system) with 6 MPI
>> tasks (on a single node). The CUDA cases were given one GPU per task (and
>> used CUDA-aware MPI). The labels on the blue bars indicate speedup. In all
>> cases we used "-fieldsplit_0_pc_type jacobi" to keep the comparison as
>> consistent as possible.
>>
>> The performance of RHSJacobian (where the bulk of computation happens in
>> xolotl) behaved basically as expected (better than expected in the serial
>> build). NE_3 case in CUDA was the only one that performed worse, but not
>> surprisingly, since its workload for the GPUs is much smaller. We've still
>> got more optimization to do on this.
>>
>> The real surprise was how much worse the overall solve times were. This
>> seems to be due simply to switching to the kokkos-based implementation. I'm
>> wondering if there are any changes we can make in configuration or runtime
>> arguments to help with PETSc's performance here. Any help looking into this
>> would be appreciated.
>>
>> The tarballs linked here
>> <https://drive.google.com/file/d/19X_L3SVkGBM9YUzXnRR_kVWFG0JFwqZ3/view?usp=drive_link>
>> and here
>> <https://drive.google.com/file/d/15yDBN7-YlO1g6RJNPYNImzr611i1Ffhv/view?usp=drive_link>
>> are profiling databases which, once extracted, can be viewed with
>> hpcviewer. I don't know how helpful that will be, but hopefully it can give
>> you some direction.
>>
>> Thanks for your help,
>>
>>
>> *Philip Fackler *
>> Research Software Engineer, Application Engineering Group
>> Advanced Computing Systems Research Section
>> Computer Science and Mathematics Division
>> *Oak Ridge National Laboratory*
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231005/9b0f2e03/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot 2023-10-05 at 10.55.29?AM.png
Type: image/png
Size: 144341 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231005/9b0f2e03/attachment-0001.png>