<div dir="ltr"><div>Hi, Grant,</div><div>  I could reproduce the issue with your code.  I think petsc code has some problems and I created an issue at <a href="https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/issues/1826__;!!G_uCfscf7eWS!ZSsk7IMQF7yL-THgMdfh_H3K7F1HUJg38n2dhkaBkJR1IvhSOpfX3c1TZLEL6JDNyCGACV-PEFWtIy-WgsKA8roDoTvm$">https://gitlab.com/petsc/petsc/-/issues/1826</a>.  Though we should fix it (not sure how for now),  I think a much simpler approach is to use <span style="font-family:monospace">CUDA_VISIBLE_DEVICES. </span>For example, if you just want ranks 0, 4 to use GPUs 0, 1 respectively,  you can just delete these lines in your example</div><div><span style="color:rgb(0,0,0);font-family:Menlo,Monaco,"Courier New",monospace;font-size:14px;white-space:pre">    </span><span style="font-family:Menlo,Monaco,"Courier New",monospace;font-size:14px;white-space:pre;color:rgb(0,0,255)">if</span><span style="color:rgb(0,0,0);font-family:Menlo,Monaco,"Courier New",monospace;font-size:14px;white-space:pre"> (global_rank == </span><span style="font-family:Menlo,Monaco,"Courier New",monospace;font-size:14px;white-space:pre;color:rgb(9,134,88)">0</span><span style="color:rgb(0,0,0);font-family:Menlo,Monaco,"Courier New",monospace;font-size:14px;white-space:pre">) {</span></div><div><div style="line-height:21px"><div style="color:rgb(0,0,0);white-space:pre;font-size:14px;font-family:Menlo,Monaco,"Courier New",monospace">      cudaSetDevice(<span style="color:rgb(9,134,88)">0</span>);</div><div style="color:rgb(0,0,0);white-space:pre;font-size:14px;font-family:Menlo,Monaco,"Courier New",monospace">    } <span style="color:rgb(0,0,255)">else</span> <span style="color:rgb(0,0,255)">if</span> (global_rank == <span style="color:rgb(9,134,88)">4</span>) {</div><div style="color:rgb(0,0,0);white-space:pre;font-size:14px;font-family:Menlo,Monaco,"Courier New",monospace">      cudaSetDevice(<span style="color:rgb(9,134,88)">1</span>);</div><div style="color:rgb(0,0,0);white-space:pre;font-size:14px;font-family:Menlo,Monaco,"Courier New",monospace">    }</div><div style="color:rgb(0,0,0);white-space:pre;font-size:14px;font-family:Menlo,Monaco,"Courier New",monospace"><br></div>Then, instead, just make GPUs 0, 1 visible to ranks 0, 4 respectively upfront, by<div style="color:rgb(0,0,0);white-space:pre;font-size:14px;font-family:Menlo,Monaco,"Courier New",monospace"><br></div><div style="color:rgb(0,0,0);white-space:pre"><font face="monospace" style="">$ cat set_gpu_device <br>#!/bin/bash<br># select_gpu_device wrapper script<br>export CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/2)))<br>exec $*</font></div></div></div><div><font face="monospace"><br></font></div><div><font face="monospace">$ mpirun -n 8 ./set_gpu_device  ./ex0 <br>[Rank 5] no computation assigned.<br>[Rank 6] no computation assigned.<br>[Rank 7] no computation assigned.<br>[Rank 0] using GPU 0, [line 23].<br>[Rank 0] using GPU 0, [line 32] after setdevice.<br>[Rank 1] no computation assigned.<br>[Rank 2] no computation assigned.<br>[Rank 3] no computation assigned.<br>[Rank 4] using GPU 0, [line 23].<br>[Rank 4] using GPU 0, [line 32] after setdevice.<br>[Rank 0] using GPU 0, [line 42] after create A.<br>[Rank 4] using GPU 0, [line 42] after create A.<br>[Rank 4] using GPU 0, [line 46] after set A type.<br>[Rank 0] using GPU 0, [line 46] after set A type.<br>[Rank 0] using GPU 0, [line 50] after MatSetUp.<br>[Rank 4] using GPU 0, [line 50] after MatSetUp.<br>[Rank 0] using GPU 0, [line 63] after Mat Assemble.<br>[Rank 4] using GPU 0, [line 63] after Mat Assemble.<br>Smallest eigenvalue = 100.000000<br>Smallest eigenvalue = 100.000000</font></div><div><span style="font-family:monospace"><br></span></div>Note for rank 4, GPU 0 is actually the physical GPU 1.<div><br></div><div>Let me know if it works. <br><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div><br></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Thu, Nov 13, 2025 at 11:17 AM Grant Chao <<a href="mailto:grantchao2018@163.com">grantchao2018@163.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="line-height:1.7;color:rgb(0,0,0);font-size:14px;font-family:Arial"><div id="m_-2444336133474881563spnEditorContent"><div style="margin:0px">Junchao,</div><div style="margin:0px">We have tried cudaSetDevice.</div><div style="margin:0px">The test code is attached. 8 cpu and 2 gpu are used. And we create a gpu_comm including rank 0 and rank 4.</div><div style="margin:0px">Then we set gpu 0 to rank 0, gpu 1 to rank 1 respectively.</div><div style="margin:0px">After MatSetType, rank 1 is mapped to gpu0 again.</div><div style="margin:0px"><br></div><div style="margin:0px">The run cmd is </div><div style="margin:0px">    mpirun -n 8 ./a.out  -eps_type jd -st_ksp_type gmres -st_pc_type none</div><div style="margin:0px"><br></div><div style="margin:0px">The std out is show below,</div><div style="margin:0px">[Rank 0] using GPU 0, [line 22].

</div><div style="margin:0px">[Rank 1] no computation assigned.

</div><div style="margin:0px">[Rank 2] no computation assigned.

</div><div style="margin:0px">[Rank 3] no computation assigned.

</div><div style="margin:0px">[Rank 4] using GPU 0, [line 22].

</div><div style="margin:0px">[Rank 5] no computation assigned.

</div><div style="margin:0px">[Rank 6] no computation assigned.

</div><div style="margin:0px">[Rank 7] no computation assigned.

</div><div style="margin:0px">[Rank 4] using GPU 1, [line 31] after setdevice.   -------- Here set device successfully</div><div style="margin:0px">[Rank 0] using GPU 0, [line 31] after setdevice.

</div><div style="margin:0px">[Rank 4] using GPU 1, [line 41] after create A.

</div><div style="margin:0px">[Rank 0] using GPU 0, [line 41] after create A.

</div><div style="margin:0px">[Rank 0] using GPU 0, [line 45] after set A type.

</div><div style="margin:0px">[Rank 4] using GPU 0, [line 45] after set A type.      ------ change to 0?</div><div style="margin:0px">[Rank 4] using GPU 0, [line 49] after MatSetUp.

</div><div style="margin:0px">[Rank 0] using GPU 0, [line 49] after MatSetUp.

</div><div style="margin:0px">[Rank 4] using GPU 0, [line 62] after Mat Assemble.

</div><div style="margin:0px">[Rank 0] using GPU 0, [line 62] after Mat Assemble.

</div><div style="margin:0px">Smallest eigenvalue = 100.000000

</div><div style="margin:0px">Smallest eigenvalue = 100.000000

</div><div style="margin:0px"><br></div><div style="margin:0px">BEST,</div><div style="margin:0px">Grant</div><div style="margin:0px"><br></div><div style="margin:0px"><br></div><div style="margin:0px"><br></div></div><div style="zoom:1"></div><div id="m_-2444336133474881563divNeteaseMailCard"></div><div style="margin:0px"><br></div><p>At 2025-11-13 05:58:05, "Junchao Zhang" <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>> wrote:</p><blockquote id="m_-2444336133474881563isReplyContent" style="padding-left:1ex;margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)"><div dir="ltr"><div dir="ltr"><div>A common approach is to use <font face="monospace">CUDA_VISIBLE_DEVICES</font> to manipulate MPI ranks to GPUs mapping, see the section at <a href="https://urldefense.us/v3/__https://docs.nersc.gov/jobs/affinity/*gpu-nodes__;Iw!!G_uCfscf7eWS!ZSsk7IMQF7yL-THgMdfh_H3K7F1HUJg38n2dhkaBkJR1IvhSOpfX3c1TZLEL6JDNyCGACV-PEFWtIy-WgsKA8pWxGvch$" target="_blank">https://docs.nersc.gov/jobs/affinity/#gpu-nodes</a></div><br>With OpenMPI,  you can use OMPI_COMM_WORLD_LOCAL_RANK in place of SLURM_LOCALID (see <a href="https://urldefense.us/v3/__https://docs.open-mpi.org/en/v5.0.x/tuning-apps/environment-var.html__;!!G_uCfscf7eWS!ZSsk7IMQF7yL-THgMdfh_H3K7F1HUJg38n2dhkaBkJR1IvhSOpfX3c1TZLEL6JDNyCGACV-PEFWtIy-WgsKA8khuXtvj$" target="_blank">https://docs.open-mpi.org/en/v5.0.x/tuning-apps/environment-var.html</a>). For example, with 8 MPI ranks and 4 GPUs per node, the following script will map ranks 0, 1 to GPU 0, ranks 2, 3 to GPU 1.<div><br></div><font face="monospace">#!/bin/bash </font></div><div dir="ltr"><font face="monospace"># select_gpu_device wrapper script </font></div><div dir="ltr"><font face="monospace">export CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/4)))<br>exec $*</font><div><br></div></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Nov 12, 2025 at 10:20 AM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><br id="m_-2444336133474881563m_-8840965942260621007lineBreakAtBeginningOfMessage"><div><br><blockquote type="cite"><div>On Nov 12, 2025, at 2:31 AM, Grant Chao <<a href="mailto:grantchao2018@163.com" target="_blank">grantchao2018@163.com</a>> wrote:</div><br><div><div id="m_-2444336133474881563m_-8840965942260621007spnEditorSign_app"><br></div><div>Thank you for the suggestion.</div><div><br></div><div>We have already tried running multiple CPU ranks with a single GPU. However, we observed that as the number of ranks increases, the EPS solver becomes significantly slower. We are not sure of the exact cause—could it be due to process access contention, hidden data transfers, or perhaps another reason? We would be very interested to hear your insight on this matter.</div><div><br></div><div>To avoid this problem, we used the gpu_comm approach mentioned before. During testing, we noticed that the mapping between rank ID and GPU ID seems to be set automatically and is not user-specifiable.</div><div><br></div><div>For example, with 4 GPUs (0-3) and 8 CPU ranks (0-7), the program binds ranks 0 and 4 to GPU 0, ranks 1 and 5 to GPU 1, and so on.</div></div></blockquote><div><br></div> </div><div><blockquote type="cite"><div><div>We tested possible solutions, such as calling cudaSetDevice() manually to set rank 4 to device 1, but it did not work as expected. Ranks 0 and 4 still used GPU 0.</div><div><br></div><div>We would appreciate your guidance on how to customize this mapping. Thank you for your support.</div></div></blockquote><div><br></div><div>  So you have a single compute "node" connected to multiple GPUs?  Then the mapping of MPI ranks to GPUs doesn't matter and changing it won't improve the performance. </div></div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div><br></div><div><blockquote type="cite">However, we observed that as the number of ranks increases, the EPS solver becomes significantly slower.</blockquote><br></div><div>  Does the number of EPS "iterations" increase? Run with one, two, four and eight MPI ranks (and the same number of "GPUs" (if you only have say four GPUs that is fine, just virtualize them so two different MPI ranks share one) and the option -log_view and send the output. We need to know what is slowing down before trying to find any cure.</div><div><br></div><div>  Barry</div><div><br></div><div><br></div><div><br></div><br><blockquote type="cite"><div><div><br></div><div>Best wishes,</div><div>Grant</div><br><br>At 2025-11-12 11:48:47, "Junchao Zhang" <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>>, said: <br> <blockquote id="m_-2444336133474881563m_-8840965942260621007isReplyContent" style="padding-left:1ex;margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)"><div dir="ltr"><div>Hi, Wenbo,</div><div>   I think your approach should work.  But before going this extra step with gpu_comm,  have you tried to map multiple MPI ranks (CPUs) to one GPU, using nvidia's multiple process service (MPS)?  If MPS works well,  then you can avoid the extra complexity. </div><div><br></div><div><div dir="ltr" class="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <<a href="mailto:zhaowenbo.npic@gmail.com" target="_blank">zhaowenbo.npic@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">Dear all,<div dir="auto"><br></div><div dir="auto">We are trying to solve ksp using GPUs.</div><div dir="auto">We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which the matrix is created and assembling using COO way provided by PETSc. In this example, the number of CPU is as same as the number of GPU.</div><div dir="auto">In our case, computation of the parameters of matrix is performed on CPUs. And the cost of it  is expensive, which might take half of total time or even more. </div><div dir="auto"><br></div><div dir="auto"> We want to use more CPUs to compute parameters in parallel. And a smaller communication domain (such as gpu_comm) for the CPUs corresponding to the GPUs is created. The parameters are computed by all of the CPUs (in MPI_COMM_WORLD). Then, the parameters are send to gpu_comm related CPUs via MPI. Matrix (type of aijcusparse) is then created and assembled within gpu_comm. Finally, ksp_solve is performed on GPUs.</div><div dir="auto"><br></div><div dir="auto">I’m not sure if this approach will work in practice. Are there any comparable examples I can look to for guidance?</div><div dir="auto"><br></div><div dir="auto">Best,</div><div dir="auto">Wenbo</div></div>

</blockquote></div>

</blockquote></div></blockquote></div><br></div></blockquote></div></div>

</blockquote></div></blockquote></div>