[petsc-users] CUDA-Aware MPI & PETSc
Zhang, Junchao
jczhang at mcs.anl.gov
Thu Aug 22 13:03:28 CDT 2019
Definitely I will do. Thanks.
--Junchao Zhang
On Thu, Aug 22, 2019 at 11:34 AM David Gutzwiller <david.gutzwiller at gmail.com<mailto:david.gutzwiller at gmail.com>> wrote:
Hello Junchao,
Spectacular news!
I have our production code running on Summit (Power9 + Nvidia V100) and on local x86 workstations, and I can definitely provide comparative benchmark data with this feature once it is ready. Just let me know when it is available for testing and I'll be happy to contribute.
Thanks,
-David
[https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon> Virus-free. www.avast.com<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
On Thu, Aug 22, 2019 at 7:22 AM Zhang, Junchao <jczhang at mcs.anl.gov<mailto:jczhang at mcs.anl.gov>> wrote:
This feature is under active development. I hope I can make it usable in a couple of weeks. Thanks.
--Junchao Zhang
On Wed, Aug 21, 2019 at 3:21 PM David Gutzwiller via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
Hello,
I'm currently using PETSc for the GPU acceleration of simple Krylov solver with GMRES, without preconditioning. This is within the framework of our in-house multigrid solver. I am getting a good GPU speedup on the finest grid level but progressively worse performance on each coarse level. This is not surprising, but I still hope to squeeze out some more performance, hopefully making it worthwhile to run some or all of the coarse grids on the GPU.
I started investigating with nvprof / nsight and essentially came to the same conclusion that Xiangdong reported in a recent thread (July 16, "MemCpy (HtoD and DtoH) in Krylov solver"). My question is a follow-up to that thread:
The MPI communication is staged from the host, which results in some H<->D transfers for every mat-vec operation. A CUDA-aware MPI implementation might avoid these transfers for communication between ranks that are assigned to the same accelerator. Has this been implemented or tested?
In our solver we typically run with multiple MPI ranks all assigned to a single device, and running with a single rank is not really feasible as we still have a sizable amount of work for the CPU to chew through. Thus, I think quite a lot of the H<->D transfers could be avoided if I can skip the MPI staging on the host. I am quite new to PETSc so I wanted to ask around before blindly digging into this.
Thanks for your help,
David
[https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon> Virus-free. www.avast.com<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190822/f8838170/attachment.html>
More information about the petsc-users
mailing list