[petsc-users] CUDA-Aware MPI & PETSc

David Gutzwiller david.gutzwiller at gmail.com
Tue Oct 8 09:57:05 CDT 2019


Hi Junchao,

Thanks for letting me know.

I'm currently in a bit of a crunch for an upcoming product release, but
once I have a few days to refocus on this task I'll test the latest master
and let you know how it performs.

-David

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
Virus-free.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Mon, Oct 7, 2019 at 3:09 PM Zhang, Junchao <jczhang at mcs.anl.gov> wrote:

> Hello, David,
>    It took a longer time than I expected to add the CUDA-aware MPI feature
> in PETSc. It is now in PETSc-3.12, released last week. I have a little fix
> after that, so you better use petsc master.  Use petsc option
> -use_gpu_aware_mpi to enable it. On Summit, you also need jsrun
> --smpiargs="-gpu" to enable IBM Spectrum MPI's CUDA support. If you run
> with multiple MPI ranks per GPU, you also need #BSUB -alloc_flags gpumps in
> your job script.
>   My experiments (using a simple test doing repeated MatMult) on Summit is
> mixed. With one MPI rank per GPU, I saw very good performance improvement
> (up to 25%). But with multiple ranks per GPU, I did not see improvement.
> That sounds absurd since it should be easier for MPI ranks communicate data
> on the same GPU. I'm investigating this issue.
>   If you can also evaluate this feature with your production code,
> that would be helpful.
>   Thanks.
> --Junchao Zhang
>
>
> On Thu, Aug 22, 2019 at 11:34 AM David Gutzwiller <
> david.gutzwiller at gmail.com> wrote:
>
>> Hello Junchao,
>>
>> Spectacular news!
>>
>> I have our production code running on Summit (Power9 + Nvidia V100) and
>> on local x86 workstations, and I can definitely provide comparative
>> benchmark data with this feature once it is ready.  Just let me know when
>> it is available for testing and I'll be happy to contribute.
>>
>> Thanks,
>>
>> -David
>>
>>
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon> Virus-free.
>> www.avast.com
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
>> <#m_-4091424993276853060_m_5253247854773161582_m_1798056796989138033_m_-4452134796509464077_m_1768252039860486047_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>> On Thu, Aug 22, 2019 at 7:22 AM Zhang, Junchao <jczhang at mcs.anl.gov>
>> wrote:
>>
>>> This feature is under active development. I hope I can make it usable in
>>> a couple of weeks. Thanks.
>>> --Junchao Zhang
>>>
>>>
>>> On Wed, Aug 21, 2019 at 3:21 PM David Gutzwiller via petsc-users <
>>> petsc-users at mcs.anl.gov> wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm currently using PETSc for the GPU acceleration of simple Krylov
>>>> solver with GMRES, without preconditioning.   This is within the framework
>>>> of our in-house multigrid solver.  I am getting a good GPU speedup on the
>>>> finest grid level but progressively worse performance on each coarse
>>>> level.   This is not surprising, but I still hope to squeeze out some more
>>>> performance, hopefully making it worthwhile to run some or all of the
>>>> coarse grids on the GPU.
>>>>
>>>> I started investigating with nvprof / nsight and essentially came to
>>>> the same conclusion that Xiangdong reported in a recent thread (July 16,
>>>> "MemCpy (HtoD and DtoH) in Krylov solver").  My question is a follow-up to
>>>> that thread:
>>>>
>>>> The MPI communication is staged from the host, which results in some
>>>> H<->D transfers for every mat-vec operation.   A CUDA-aware MPI
>>>> implementation might avoid these transfers for communication between ranks
>>>> that are assigned to the same accelerator.   Has this been implemented or
>>>> tested?
>>>>
>>>> In our solver we typically run with multiple MPI ranks all assigned to
>>>> a single device, and running with a single rank is not really feasible as
>>>> we still have a sizable amount of work for the CPU to chew through.  Thus,
>>>> I think quite a lot of the H<->D transfers could be avoided if I can skip
>>>> the MPI staging on the host. I am quite new to PETSc so I wanted to ask
>>>> around before blindly digging into this.
>>>>
>>>> Thanks for your help,
>>>>
>>>> David
>>>>
>>>>
>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon> Virus-free.
>>>> www.avast.com
>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
>>>> <#m_-4091424993276853060_m_5253247854773161582_m_1798056796989138033_m_-4452134796509464077_m_1768252039860486047_m_2897511617293957267_m_5808030803546790052_m_-3093947404852640465_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20191008/58b9b122/attachment.html>


More information about the petsc-users mailing list