<html aria-label="message body"><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;"> Let me try to understand your setup. <div><br></div><div>You have two physical GPUs and a CPU with at least 16 physical cores? </div><div><br></div><div>You run with 16 MPI processes, each using its own "virtual" GPU (via MPS). Thus, a single physical GPU is shared by 8 MPI processes?</div><div><br></div><div>What happens if you run with 4 MPI processes, compared with 2? </div><div><br></div><div>Can you run with -log_view and send the output when using 2, 4, and 8 MPI processes?  </div><div><br></div><div>Barry</div><div><br id="lineBreakAtBeginningOfMessage"><div><br><blockquote type="cite"><div>On Jan 19, 2026, at 5:52 AM, Gabriele Penazzi via petsc-users <petsc-users@mcs.anl.gov> wrote:</div><br class="Apple-interchange-newline"><div><meta charset="UTF-8"><div class="elementToProof" style="font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt;">Hi.<br><br>I am using PETSc conjugate gradient liner solver with GPU acceleration (CUDA), on multiple GPUs and multiple MPI processes.<br><br>I noticed that the performances degrade significantly when using multiple MPI processes per GPU, compared to using a single process per GPU.<br>For example, 2 GPUs with 2 MPI processes will be about 40% faster than running the same calculation with 2 GPUs and 16 MPI processes.<br><br>I would assume the natural MPI/GPU affinity would be 1-1, however the rest of my application can benefit from multiple MPI processes driving GPU via nvidia MPS, therefore I am trying to understand if this is expected, if I am possibly missing something in the initialization/setup, or if my best choice is to constrain 1-1 MPI/GPU access especially for the PETSc linear solver step. I could not find explicit information about it in the manual.<br><br>Is there any user or maintainer who can tell me more about this use case?<br> </div><div class="elementToProof" id="Signature" style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 18px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-line: none; text-decoration-thickness: auto; text-decoration-style: solid;"><div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt;">Best Regards,<br>Gabriele Penazzi</div></div></div></blockquote></div><br></div></body></html>