<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>


</head>


<body dir="ltr">


<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">


Hi Junchao,<br>


<br>


I am already using MPS, but thanks for the suggestion. <br>


It does make a large difference indeed, I think in general it'd be a very useful documentation entry </div>


<div id="appendonsend"></div>


<div class="elementToProof"><br>


<span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">Thank you,<br>


Gabriele</span></div>


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


<br>


</div>


<hr style="display: inline-block; width: 98%;">


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


<b>From:</b> Junchao Zhang <junchao.zhang@gmail.com><br>


<b>Sent:</b> Tuesday, January 20, 2026 5:17 PM<br>


<b>To:</b> Gabriele Penazzi <Gabriele.Penazzi@synopsys.com><br>


<b>Cc:</b> petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov><br>


<b>Subject:</b> Re: [petsc-users] Performance with GPU and multiple MPI processes per GPU


</div>


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


<br>


</div>


<div style="direction: ltr;">Hello Babriele,</div>


<div style="direction: ltr;">  Maybe you can try CUDA MPS service, to effectively map multiple processes to one GPU.  First, I would create a directory $HOME/tmp/nvidia-mps  (by default, cuda will use /tmp/nvidia-mps), then use these steps</div>


<div style="direction: ltr;"><br>


</div>


<div style="direction: ltr; font-family: monospace;">export CUDA_MPS_PIPE_DIRECTORY=$HOME/tmp/nvidia-mps</div>


<div style="direction: ltr; font-family: monospace;">export CUDA_MPS_LOG_DIRECTORY=$HOME/tmp/nvidia-mps<br>


<br>


# Start MPS<br>


nvidia-cuda-mps-control -d<br>


<br>


# run the test</div>


<div style="direction: ltr; font-family: monospace;">mpiexec -n 16 ./test </div>


<div style="direction: ltr; font-family: monospace;"><br>


</div>


<div style="direction: ltr; font-family: monospace;"># shut down MPS<br>


echo quit | nvidia-cuda-mps-control</div>


<div style="direction: ltr;"><br>


</div>


<div style="direction: ltr;">I would also like to block-map MPI processes to GPUs manually via manipulating the env var CUDA_VISIBLE_DEVICES.   So I have this bash script<i> set_gpu_device.sh


</i>on my PATH (assume you use OpenMPI)</div>


<div style="direction: ltr;"><br>


</div>


<div style="direction: ltr; font-family: monospace;">#!/bin/bash<br>


GPUS_PER_NODE=2<br>


export CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/GPUS_PER_NODE)))</div>


<div style="direction: ltr; font-family: monospace;">exec $*</div>


<div style="direction: ltr;"><br>


</div>


<div style="direction: ltr;">In other words, to run the test, I use</div>


<div style="direction: ltr;"><br>


</div>


<div style="direction: ltr; font-family: monospace;">mpiexec -n 16 set_gpu_device.sh ./test</div>


<div style="direction: ltr;"><br>


</div>


<div style="direction: ltr;">Let us know if it helps so that we can add the instructions to the PETSc doc. </div>


<div style="direction: ltr;"><br>


</div>


<div style="direction: ltr;">Thanks.</div>


<div style="direction: ltr;">--Junchao Zhang</div>


<div style="direction: ltr;"><br>


</div>


<div><br>


</div>


<div style="direction: ltr;">On Tue, Jan 20, 2026 at 8:21 AM Gabriele Penazzi via petsc-users <<a class="OWAAutoLink" id="OWA9e3ae996-21dc-6829-d690-150dbf1e464e" href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a>> wrote:</div>


<blockquote style="margin: 0px 0px 0px 0.8ex; padding-left: 1ex; border-left: 1px solid rgb(204, 204, 204);">


<div style="direction: ltr; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">


Hi.<br>


<br>


I am using PETSc conjugate gradient liner solver with GPU acceleration (CUDA), on multiple GPUs and multiple MPI processes.<br>


<br>


I noticed that the performances degrade significantly when using multiple MPI processes per GPU, compared to using a single process per GPU.<br>


For example, 2 GPUs with 2 MPI processes will be about 40% faster than running the same calculation with 2 GPUs and 16 MPI processes.<br>


<br>


I would assume the natural MPI/GPU affinity would be 1-1, however the rest of my application can benefit from multiple MPI processes driving GPU via nvidia MPS, therefore I am trying to understand if this is expected, if I am possibly missing something in the


 initialization/setup, or if my best choice is to constrain 1-1 MPI/GPU access especially for the PETSc linear solver step. I could not find explicit information about it in the manual.<br>


<br>


Is there any user or maintainer who can tell me more about this use case?<br>


 </div>


<div id="x_m_-1961533414287312277Signature">


<div style="direction: ltr; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">


Best Regards,<br>


Gabriele Penazzi</div>


<div style="direction: ltr; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">


<br>


</div>


<p style="direction: ltr; margin: 0cm; font-family: Aptos, sans-serif; font-size: 12pt;">


 </p>


<p style="direction: ltr; margin: 0cm; font-family: Aptos, sans-serif; font-size: 12pt;">


 </p>


</div>


</blockquote>


</body>


</html>