<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>


</head>


<body dir="ltr">


<div style="line-height: 20px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">


Hi Barry, <br>


<br>


yes that's exactly the setup, multiple processes share a single physical GPU via MPS, and the GPUs are assigned upfront to guarantee fair balance.  <br>


<br>


I’ve looked further into this, and the behavior seems to be related to the problem size in my application. When I increase the number of DOFs, I no longer observe any slowdown with multiple MPI processes per GPU.</div>


<div style="line-height: 20px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">


<br>


I should also mention that I’m compiling PETSc <b>without GPU‑aware MPI</b>. I know this is not recommended, so my results may not be fully representative. Unfortunately, due to constraints in the toolchain I can use, this is the only way I can compile PETSc


 for the time being.</div>


<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">


<br>


I can also reproduce the issue on a single GPU, but only for relatively small problems. For example, with about 2e6 DOFs, going from 4 to 8 MPI processes introduces a noticeable performance penalty on the GPU (while the same configuration still scales reasonably


 well on the CPU). I’ve attached the <code>-log_view</code> outputs for the 1‑, 4‑, and 8‑process cases for this setup.</div>


<div style="line-height: 20px; margin-top: 1em; margin-bottom: 1em; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">


Since this degradation only shows up for smaller DOF counts, it sounds more like I’m misusing the library (or operating in a regime where overheads dominate).</div>


<div style="line-height: 20px; margin-top: 1em; margin-bottom: 1em; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">


Based on this, my tentative conclusion is that, in general, using a communicator that maps one MPI process per GPU is a better approach. Would you consider that a fair statement?</div>


<div class="elementToProof"><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">Thanks,<br>


Gabriele</span><br>


<br>


<br>


</div>


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


<br>


</div>


<hr style="display: inline-block; width: 98%;">


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


<b>From:</b> Barry Smith <bsmith@petsc.dev><br>


<b>Sent:</b> Tuesday, January 20, 2026 4:14 PM<br>


<b>To:</b> Gabriele Penazzi <Gabriele.Penazzi@synopsys.com><br>


<b>Cc:</b> petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov><br>


<b>Subject:</b> Re: [petsc-users] Performance with GPU and multiple MPI processes per GPU


</div>


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


<br>


</div>


<div> Let me try to understand your setup. </div>


<div><br>


</div>


<div>You have two physical GPUs and a CPU with at least 16 physical cores? </div>


<div><br>


</div>


<div>You run with 16 MPI processes, each using its own "virtual" GPU (via MPS). Thus, a single physical GPU is shared by 8 MPI processes?</div>


<div><br>


</div>


<div>What happens if you run with 4 MPI processes, compared with 2? </div>


<div><br>


</div>


<div>Can you run with -log_view and send the output when using 2, 4, and 8 MPI processes?  </div>


<div><br>


</div>


<div>Barry</div>


<div><br>


</div>


<div><br>


</div>


<blockquote>


<div>On Jan 19, 2026, at 5:52 AM, Gabriele Penazzi via petsc-users <petsc-users@mcs.anl.gov> wrote:</div>


<div><br>


</div>


<div style="text-align: left; text-indent: 0px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt;">


Hi.<br>


<br>


I am using PETSc conjugate gradient liner solver with GPU acceleration (CUDA), on multiple GPUs and multiple MPI processes.<br>


<br>


I noticed that the performances degrade significantly when using multiple MPI processes per GPU, compared to using a single process per GPU.<br>


For example, 2 GPUs with 2 MPI processes will be about 40% faster than running the same calculation with 2 GPUs and 16 MPI processes.<br>


<br>


I would assume the natural MPI/GPU affinity would be 1-1, however the rest of my application can benefit from multiple MPI processes driving GPU via nvidia MPS, therefore I am trying to understand if this is expected, if I am possibly missing something in the


 initialization/setup, or if my best choice is to constrain 1-1 MPI/GPU access especially for the PETSc linear solver step. I could not find explicit information about it in the manual.<br>


<br>


Is there any user or maintainer who can tell me more about this use case?<br>


 </div>


<div id="x_Signature">


<div style="text-align: left; text-indent: 0px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt;">


Best Regards,<br>


Gabriele Penazzi</div>


</div>


</blockquote>


<div><br>


</div>


</body>


</html>