[petsc-users] [Xolotl-psi-development] [EXTERNAL] Re: Unexpected performance losses switching to COO interface

Blondel, Sophie sblondel at utk.edu
Wed Nov 29 10:03:49 CST 2023


Hi Jed,

I'm not sure I'm going to reply to your question correctly because I don't really understand how the split is done. Is it related to on diagonal and off diagonal? If so, the off-diagonal part is usually pretty small (less than 20 DOFs) and related to diffusion, the diagonal part involves thousands of DOFs for the reaction term.

Let us know what we can do to answer this question more accurately.

Cheers,

Sophie
________________________________
From: Jed Brown <jed at jedbrown.org>
Sent: Tuesday, November 28, 2023 19:07
To: Fackler, Philip <facklerpw at ornl.gov>; Junchao Zhang <junchao.zhang at gmail.com>
Cc: petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>; xolotl-psi-development at lists.sourceforge.net <xolotl-psi-development at lists.sourceforge.net>
Subject: Re: [Xolotl-psi-development] [petsc-users] [EXTERNAL] Re: Unexpected performance losses switching to COO interface

[Some people who received this message don't often get email from jed at jedbrown.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

"Fackler, Philip via petsc-users" <petsc-users at mcs.anl.gov> writes:

> That makes sense. Here are the arguments that I think are relevant:
>
> -fieldsplit_1_pc_type redundant -fieldsplit_0_pc_type sor -pc_type fieldsplit -pc_fieldsplit_detect_coupling​

What sort of physics are in splits 0 and 1?

SOR is not a good GPU algorithm, so we'll want to change that one way or another. Are the splits of similar size or very different?

> What would you suggest to make this better?
>
> Also, note that the cases marked "serial" are running on CPU only, that is, using only the SERIAL backend for kokkos.
>
> Philip Fackler
> Research Software Engineer, Application Engineering Group
> Advanced Computing Systems Research Section
> Computer Science and Mathematics Division
> Oak Ridge National Laboratory
> ________________________________
> From: Junchao Zhang <junchao.zhang at gmail.com>
> Sent: Tuesday, November 28, 2023 15:51
> To: Fackler, Philip <facklerpw at ornl.gov>
> Cc: petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>; xolotl-psi-development at lists.sourceforge.net <xolotl-psi-development at lists.sourceforge.net>
> Subject: Re: [EXTERNAL] Re: [petsc-users] Unexpected performance losses switching to COO interface
>
> Hi, Philip,
>    I opened hpcdb-PSI_9-serial and it seems you used PCLU.  Since Kokkos does not have a GPU LU implementation, we do it on CPU via MatLUFactorNumeric_SeqAIJ(). Perhaps you can try other PC types?
>
> [Screenshot 2023-11-28 at 2.43.03 PM.png]
> --Junchao Zhang
>
>
> On Wed, Nov 22, 2023 at 10:43 AM Fackler, Philip <facklerpw at ornl.gov<mailto:facklerpw at ornl.gov>> wrote:
> I definitely dropped the ball on this. I'm sorry for that. I have new profiling data using the latest (as of yesterday) of petsc/main. I've put them in a single google drive folder linked here:
>
> https://drive.google.com/drive/folders/14ScvyfxOzc4OzXs9HZVeQDO-g6FdIVAI?usp=drive_link<https://urldefense.us/v2/url?u=https-3A__drive.google.com_drive_folders_14ScvyfxOzc4OzXs9HZVeQDO-2Dg6FdIVAI-3Fusp-3Ddrive-5Flink&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=DAkLCjn8leYU-uJ-kfNEQMhPZWx9lzc4d5KgIR-RZWQ&m=Qn5D9xuzFcMdyuL0I2ruKmU6yeez0NrOx69oUjRaAXTeKD6etHt4USuZgnbqF4v6&s=_Lqg9v8aa4KXUdud3zqSp55FiYkZ12Pp5ZY54_9OvJI&e=>
>
> Have a happy holiday weekend!
>
> Thanks,
>
> Philip Fackler
> Research Software Engineer, Application Engineering Group
> Advanced Computing Systems Research Section
> Computer Science and Mathematics Division
> Oak Ridge National Laboratory
> ________________________________
> From: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
> Sent: Monday, October 16, 2023 15:24
> To: Fackler, Philip <facklerpw at ornl.gov<mailto:facklerpw at ornl.gov>>
> Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>; xolotl-psi-development at lists.sourceforge.net<mailto:xolotl-psi-development at lists.sourceforge.net> <xolotl-psi-development at lists.sourceforge.net<mailto:xolotl-psi-development at lists.sourceforge.net>>
> Subject: Re: [EXTERNAL] Re: [petsc-users] Unexpected performance losses switching to COO interface
>
> Hi, Philip,
>    That branch was merged to petsc/main today. Let me know once you have new profiling results.
>
>    Thanks.
> --Junchao Zhang
>
>
> On Mon, Oct 16, 2023 at 9:33 AM Fackler, Philip <facklerpw at ornl.gov<mailto:facklerpw at ornl.gov>> wrote:
> Junchao,
>
> I've attached updated timing plots (red and blue are swapped from before; yellow is the new one). There is an improvement for the NE_3 case only with CUDA. Serial stays the same, and the PSI cases stay the same. In the PSI cases, MatShift doesn't show up (I assume because we're using different preconditioner arguments). So, there must be some other primary culprit. I'll try to get updated profiling data to you soon.
>
> Thanks,
>
> Philip Fackler
> Research Software Engineer, Application Engineering Group
> Advanced Computing Systems Research Section
> Computer Science and Mathematics Division
> Oak Ridge National Laboratory
> ________________________________
> From: Fackler, Philip via Xolotl-psi-development <xolotl-psi-development at lists.sourceforge.net<mailto:xolotl-psi-development at lists.sourceforge.net>>
> Sent: Wednesday, October 11, 2023 11:31
> To: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
> Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>; xolotl-psi-development at lists.sourceforge.net<mailto:xolotl-psi-development at lists.sourceforge.net> <xolotl-psi-development at lists.sourceforge.net<mailto:xolotl-psi-development at lists.sourceforge.net>>
> Subject: Re: [Xolotl-psi-development] [EXTERNAL] Re: [petsc-users] Unexpected performance losses switching to COO interface
>
> I'm on it.
>
> Philip Fackler
> Research Software Engineer, Application Engineering Group
> Advanced Computing Systems Research Section
> Computer Science and Mathematics Division
> Oak Ridge National Laboratory
> ________________________________
> From: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
> Sent: Wednesday, October 11, 2023 10:14
> To: Fackler, Philip <facklerpw at ornl.gov<mailto:facklerpw at ornl.gov>>
> Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>; xolotl-psi-development at lists.sourceforge.net<mailto:xolotl-psi-development at lists.sourceforge.net> <xolotl-psi-development at lists.sourceforge.net<mailto:xolotl-psi-development at lists.sourceforge.net>>; Blondel, Sophie <sblondel at utk.edu<mailto:sblondel at utk.edu>>
> Subject: Re: [EXTERNAL] Re: [petsc-users] Unexpected performance losses switching to COO interface
>
> Hi,  Philip,
>   Could you try this branch jczhang/2023-10-05/feature-support-matshift-aijkokkos ?
>
>   Thanks.
> --Junchao Zhang
>
>
> On Thu, Oct 5, 2023 at 4:52 PM Fackler, Philip <facklerpw at ornl.gov<mailto:facklerpw at ornl.gov>> wrote:
> Aha! That makes sense. Thank you.
>
> Philip Fackler
> Research Software Engineer, Application Engineering Group
> Advanced Computing Systems Research Section
> Computer Science and Mathematics Division
> Oak Ridge National Laboratory
> ________________________________
> From: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>
> Sent: Thursday, October 5, 2023 17:29
> To: Fackler, Philip <facklerpw at ornl.gov<mailto:facklerpw at ornl.gov>>
> Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>; xolotl-psi-development at lists.sourceforge.net<mailto:xolotl-psi-development at lists.sourceforge.net> <xolotl-psi-development at lists.sourceforge.net<mailto:xolotl-psi-development at lists.sourceforge.net>>; Blondel, Sophie <sblondel at utk.edu<mailto:sblondel at utk.edu>>
> Subject: [EXTERNAL] Re: [petsc-users] Unexpected performance losses switching to COO interface
>
> Wait a moment, it seems it was because we do not have a GPU implementation of MatShift...
> Let me see how to add it.
> --Junchao Zhang
>
>
> On Thu, Oct 5, 2023 at 10:58 AM Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>> wrote:
> Hi, Philip,
>   I looked at the hpcdb-NE_3-cuda file. It seems you used MatSetValues() instead of the COO interface?  MatSetValues() needs to copy the data from device to host and thus is expensive.
>   Do you have profiling results with COO enabled?
>
> [Screenshot 2023-10-05 at 10.55.29 AM.png]
>
>
> --Junchao Zhang
>
>
> On Mon, Oct 2, 2023 at 9:52 AM Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>> wrote:
> Hi, Philip,
>   I will look into the tarballs and get back to you.
>    Thanks.
> --Junchao Zhang
>
>
> On Mon, Oct 2, 2023 at 9:41 AM Fackler, Philip via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
> We finally have xolotl ported to use the new COO interface and the aijkokkos implementation for Mat (and kokkos for Vec). Comparing this port to our previous version (using MatSetValuesStencil and the default Mat and Vec implementations), we expected to see an improvement in performance for both the "serial" and "cuda" builds (here I'm referring to the kokkos configuration).
>
> Attached are two plots that show timings for three different cases. All of these were run on Ascent (the Summit-like training system) with 6 MPI tasks (on a single node). The CUDA cases were given one GPU per task (and used CUDA-aware MPI). The labels on the blue bars indicate speedup. In all cases we used "-fieldsplit_0_pc_type jacobi" to keep the comparison as consistent as possible.
>
> The performance of RHSJacobian (where the bulk of computation happens in xolotl) behaved basically as expected (better than expected in the serial build). NE_3 case in CUDA was the only one that performed worse, but not surprisingly, since its workload for the GPUs is much smaller. We've still got more optimization to do on this.
>
> The real surprise was how much worse the overall solve times were. This seems to be due simply to switching to the kokkos-based implementation. I'm wondering if there are any changes we can make in configuration or runtime arguments to help with PETSc's performance here. Any help looking into this would be appreciated.
>
> The tarballs linked here<https://urldefense.us/v2/url?u=https-3A__drive.google.com_file_d_19X-5FL3SVkGBM9YUzXnRR-5FkVWFG0JFwqZ3_view-3Fusp-3Ddrive-5Flink&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=DAkLCjn8leYU-uJ-kfNEQMhPZWx9lzc4d5KgIR-RZWQ&m=GTpC2k9hIdMhUg_aJkeAqd-1CP5M8bwJMJjTriVE1k-j36ZnEHerQkZOzszxWoG2&s=GW0ImGWhWr4rR5AoSULCnaP1CN1QWxTSeMDhdOuhTEA&e=> and here<https://urldefense.us/v2/url?u=https-3A__drive.google.com_file_d_15yDBN7-2DYlO1g6RJNPYNImzr611i1Ffhv_view-3Fusp-3Ddrive-5Flink&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=DAkLCjn8leYU-uJ-kfNEQMhPZWx9lzc4d5KgIR-RZWQ&m=GTpC2k9hIdMhUg_aJkeAqd-1CP5M8bwJMJjTriVE1k-j36ZnEHerQkZOzszxWoG2&s=tO-BnNY2myA-pIsRnBjQNoaOSjn-B3-lWGiQp7XXJwk&e=> are profiling databases which, once extracted, can be viewed with hpcviewer. I don't know how helpful that will be, but hopefully it can give you some direction.
>
> Thanks for your help,
>
> Philip Fackler
> Research Software Engineer, Application Engineering Group
> Advanced Computing Systems Research Section
> Computer Science and Mathematics Division
> Oak Ridge National Laboratory


_______________________________________________
Xolotl-psi-development mailing list
Xolotl-psi-development at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xolotl-psi-development
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231129/aef13c2f/attachment-0001.html>


More information about the petsc-users mailing list