<div dir="ltr"><div>Hello Babriele,</div><div> Maybe you can try CUDA MPS service, to effectively map multiple processes to one GPU. First, I would create a directory $HOME/tmp/nvidia-mps (by default, cuda will use /tmp/nvidia-mps), then use these steps</div><div><br></div><div><font face="monospace">export CUDA_MPS_PIPE_DIRECTORY=$HOME/tmp/nvidia-mps</font></div><font face="monospace">export CUDA_MPS_LOG_DIRECTORY=$HOME/tmp/nvidia-mps<br><br># Start MPS<br>nvidia-cuda-mps-control -d<br><br># run the test</font><div><font face="monospace">mpiexec -n 16 ./test <br></font><div><font face="monospace"><br></font></div><div><font face="monospace"># shut down MPS<br>echo quit | nvidia-cuda-mps-control</font><div><br></div><div>I would also like to block-map MPI processes to GPUs manually via manipulating the env var CUDA_VISIBLE_DEVICES. So I have this bash script<i> set_gpu_device.sh </i>on my PATH (assume you use OpenMPI)</div><div><br></div></div></div><font face="monospace">#!/bin/bash<br>GPUS_PER_NODE=2<br>export CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/GPUS_PER_NODE)))<br></font><div><div><font face="monospace">exec $*</font></div></div><div><div><br></div><div>In other words, to run the test, I use</div><div><br></div><font face="monospace">mpiexec -n 16 set_gpu_device.sh ./test </font><div><br></div><div>Let us know if it helps so that we can add the instructions to the PETSc doc. </div><div><br></div><div>Thanks.</div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div><br></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Tue, Jan 20, 2026 at 8:21 AM Gabriele Penazzi via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg-1961533414287312277">
<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">
Hi.<br>
<br>
I am using PETSc conjugate gradient liner solver with GPU acceleration (CUDA), on multiple GPUs and multiple MPI processes.<br>
<br>
I noticed that the performances degrade significantly when using multiple MPI processes per GPU, compared to using a single process per GPU.<br>
For example, 2 GPUs with 2 MPI processes will be about 40% faster than running the same calculation with 2 GPUs and 16 MPI processes.<br>
<br>
I would assume the natural MPI/GPU affinity would be 1-1, however the rest of my application can benefit from multiple MPI processes driving GPU via nvidia MPS, therefore I am trying to understand if this is expected, if I am possibly missing something in the
initialization/setup, or if my best choice is to constrain 1-1 MPI/GPU access especially for the PETSc linear solver step. I could not find explicit information about it in the manual.<br>
<br>
Is there any user or maintainer who can tell me more about this use case?<br>
</div>
<div id="m_-1961533414287312277Signature">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">
Best Regards,<br>
Gabriele Penazzi</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">
<br>
</div>
<p style="margin:0cm;font-family:Aptos,sans-serif;font-size:12pt">
</p>
<p style="margin:0cm;font-family:Aptos,sans-serif;font-size:12pt">
</p>
</div>
</div>
</div></blockquote></div>