[petsc-users] Performance with GPU and multiple MPI processes per GPU

Thu Jan 22 09:21:02 CST 2026

Hi Junchao,

I am already using MPS, but thanks for the suggestion.
It does make a large difference indeed, I think in general it'd be a very useful documentation entry

Thank you,
Gabriele

________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Tuesday, January 20, 2026 5:17 PM
To: Gabriele Penazzi <Gabriele.Penazzi at synopsys.com>
Cc: petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] Performance with GPU and multiple MPI processes per GPU

Hello Babriele,
  Maybe you can try CUDA MPS service, to effectively map multiple processes to one GPU.  First, I would create a directory $HOME/tmp/nvidia-mps  (by default, cuda will use /tmp/nvidia-mps), then use these steps

export CUDA_MPS_PIPE_DIRECTORY=$HOME/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=$HOME/tmp/nvidia-mps

# Start MPS
nvidia-cuda-mps-control -d

# run the test
mpiexec -n 16 ./test

# shut down MPS
echo quit | nvidia-cuda-mps-control

I would also like to block-map MPI processes to GPUs manually via manipulating the env var CUDA_VISIBLE_DEVICES.   So I have this bash script set_gpu_device.sh on my PATH (assume you use OpenMPI)

#!/bin/bash
GPUS_PER_NODE=2
export CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/GPUS_PER_NODE)))
exec $*

In other words, to run the test, I use

mpiexec -n 16 set_gpu_device.sh ./test

Let us know if it helps so that we can add the instructions to the PETSc doc.

Thanks.
--Junchao Zhang

On Tue, Jan 20, 2026 at 8:21 AM Gabriele Penazzi via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
Hi.

I am using PETSc conjugate gradient liner solver with GPU acceleration (CUDA), on multiple GPUs and multiple MPI processes.

I noticed that the performances degrade significantly when using multiple MPI processes per GPU, compared to using a single process per GPU.
For example, 2 GPUs with 2 MPI processes will be about 40% faster than running the same calculation with 2 GPUs and 16 MPI processes.

I would assume the natural MPI/GPU affinity would be 1-1, however the rest of my application can benefit from multiple MPI processes driving GPU via nvidia MPS, therefore I am trying to understand if this is expected, if I am possibly missing something in the initialization/setup, or if my best choice is to constrain 1-1 MPI/GPU access especially for the PETSc linear solver step. I could not find explicit information about it in the manual.

Is there any user or maintainer who can tell me more about this use case?

Best Regards,
Gabriele Penazzi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20260122/d7bf8115/attachment.html>