[petsc-users] CUDA-Aware MPI & PETSc

Wed Aug 21 15:20:01 CDT 2019

Hello,

I'm currently using PETSc for the GPU acceleration of simple Krylov solver
with GMRES, without preconditioning.   This is within the framework of our
in-house multigrid solver.  I am getting a good GPU speedup on the finest
grid level but progressively worse performance on each coarse level.   This
is not surprising, but I still hope to squeeze out some more performance,
hopefully making it worthwhile to run some or all of the coarse grids on
the GPU.

I started investigating with nvprof / nsight and essentially came to the
same conclusion that Xiangdong reported in a recent thread (July 16,
"MemCpy (HtoD and DtoH) in Krylov solver").  My question is a follow-up to
that thread:

The MPI communication is staged from the host, which results in some H<->D
transfers for every mat-vec operation.   A CUDA-aware MPI implementation
might avoid these transfers for communication between ranks that are
assigned to the same accelerator.   Has this been implemented or tested?

In our solver we typically run with multiple MPI ranks all assigned to a
single device, and running with a single rank is not really feasible as we
still have a sizable amount of work for the CPU to chew through.  Thus, I
think quite a lot of the H<->D transfers could be avoided if I can skip the
MPI staging on the host. I am quite new to PETSc so I wanted to ask around
before blindly digging into this.

Thanks for your help,

David

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
Virus-free.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
<#m_-3093947404852640465_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190821/fd9a81da/attachment.html>