<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

</head>

<body dir="ltr">

<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">

Hi.<br>

<br>

I am using PETSc conjugate gradient liner solver with GPU acceleration (CUDA), on multiple GPUs and multiple MPI processes.<br>

<br>

I noticed that the performances degrade significantly when using multiple MPI processes per GPU, compared to using a single process per GPU.<br>

For example, 2 GPUs with 2 MPI processes will be about 40% faster than running the same calculation with 2 GPUs and 16 MPI processes.<br>

<br>

I would assume the natural MPI/GPU affinity would be 1-1, however the rest of my application can benefit from multiple MPI processes driving GPU via nvidia MPS, therefore I am trying to understand if this is expected, if I am possibly missing something in the

 initialization/setup, or if my best choice is to constrain 1-1 MPI/GPU access especially for the PETSc linear solver step. I could not find explicit information about it in the manual.<br>

<br>

Is there any user or maintainer who can tell me more about this use case?<br>

 </div>

<div class="elementToProof" id="Signature">

<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">

Best Regards,<br>

Gabriele Penazzi</div>

<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);" class="elementToProof">

<br>

</div>

<p style="margin: 0cm; font-family: Aptos, sans-serif; font-size: 12pt;" class="elementToProof">

 </p>

<p style="margin: 0cm; font-family: Aptos, sans-serif; font-size: 12pt;" class="elementToProof">

 </p>

</div>

</body>

</html>