[petsc-users] CUDA-Aware MPI & PETSc

David Gutzwiller david.gutzwiller at gmail.com
Thu Aug 22 11:33:23 CDT 2019


Hello Junchao,

Spectacular news!

I have our production code running on Summit (Power9 + Nvidia V100) and on
local x86 workstations, and I can definitely provide comparative benchmark
data with this feature once it is ready.  Just let me know when it is
available for testing and I'll be happy to contribute.

Thanks,

-David

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
Virus-free.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Thu, Aug 22, 2019 at 7:22 AM Zhang, Junchao <jczhang at mcs.anl.gov> wrote:

> This feature is under active development. I hope I can make it usable in a
> couple of weeks. Thanks.
> --Junchao Zhang
>
>
> On Wed, Aug 21, 2019 at 3:21 PM David Gutzwiller via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
>
>> Hello,
>>
>> I'm currently using PETSc for the GPU acceleration of simple Krylov
>> solver with GMRES, without preconditioning.   This is within the framework
>> of our in-house multigrid solver.  I am getting a good GPU speedup on the
>> finest grid level but progressively worse performance on each coarse
>> level.   This is not surprising, but I still hope to squeeze out some more
>> performance, hopefully making it worthwhile to run some or all of the
>> coarse grids on the GPU.
>>
>> I started investigating with nvprof / nsight and essentially came to the
>> same conclusion that Xiangdong reported in a recent thread (July 16,
>> "MemCpy (HtoD and DtoH) in Krylov solver").  My question is a follow-up to
>> that thread:
>>
>> The MPI communication is staged from the host, which results in some
>> H<->D transfers for every mat-vec operation.   A CUDA-aware MPI
>> implementation might avoid these transfers for communication between ranks
>> that are assigned to the same accelerator.   Has this been implemented or
>> tested?
>>
>> In our solver we typically run with multiple MPI ranks all assigned to a
>> single device, and running with a single rank is not really feasible as we
>> still have a sizable amount of work for the CPU to chew through.  Thus, I
>> think quite a lot of the H<->D transfers could be avoided if I can skip the
>> MPI staging on the host. I am quite new to PETSc so I wanted to ask around
>> before blindly digging into this.
>>
>> Thanks for your help,
>>
>> David
>>
>>
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon> Virus-free.
>> www.avast.com
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
>> <#m_2897511617293957267_m_5808030803546790052_m_-3093947404852640465_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190822/2336385f/attachment.html>


More information about the petsc-users mailing list