<div dir="ltr"><div>Hi, Wenbo,</div><div>   I think your approach should work.  But before going this extra step with gpu_comm,  have you tried to map multiple MPI ranks (CPUs) to one GPU, using nvidia's multiple process service (MPS)?  If MPS works well,  then you can avoid the extra complexity. </div><div><br></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div><br></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <<a href="mailto:zhaowenbo.npic@gmail.com">zhaowenbo.npic@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">Dear all,<div dir="auto"><br></div><div dir="auto">We are trying to solve ksp using GPUs.</div><div dir="auto">We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which the matrix is created and assembling using COO way provided by PETSc. In this example, the number of CPU is as same as the number of GPU.</div><div dir="auto">In our case, computation of the parameters of matrix is performed on CPUs. And the cost of it  is expensive, which might take half of total time or even more. </div><div dir="auto"><br></div><div dir="auto"> We want to use more CPUs to compute parameters in parallel. And a smaller communication domain (such as gpu_comm) for the CPUs corresponding to the GPUs is created. The parameters are computed by all of the CPUs (in MPI_COMM_WORLD). Then, the parameters are send to gpu_comm related CPUs via MPI. Matrix (type of aijcusparse) is then created and assembled within gpu_comm. Finally, ksp_solve is performed on GPUs.</div><div dir="auto"><br></div><div dir="auto">I’m not sure if this approach will work in practice. Are there any comparable examples I can look to for guidance?</div><div dir="auto"><br></div><div dir="auto">Best,</div><div dir="auto">Wenbo</div></div>

</blockquote></div>