[petsc-users] gpu cpu parallel

Wed Nov 12 09:58:21 CST 2025

On Wed, Nov 12, 2025 at 1:31 AM Grant Chao <grantchao2018 at 163.com> wrote:

>
> Thank you for the suggestion.
>
> We have already tried running multiple CPU ranks with a single GPU.
> However, we observed that as the number of ranks increases, the EPS solver
> becomes significantly slower. We are not sure of the exact cause—could it
> be due to process access contention, hidden data transfers, or perhaps
> another reason? We would be very interested to hear your insight on this
> matter.
>
Have you started the MPS, see
https://urldefense.us/v3/__https://docs.nvidia.com/deploy/mps/index.html*starting-and-stopping-mps-on-linux__;Iw!!G_uCfscf7eWS!fRqGFSTH6neOLcmMT1alt2Uma1K1jVsAm1kXTHrg5nNNe-dVKOn6jIJvkO6q0AKcW9s3WvmnXT3jqrh2NFk1hBiuCBlC$ 

>
> To avoid this problem, we used the gpu_comm approach mentioned before.
> During testing, we noticed that the mapping between rank ID and GPU ID
> seems to be set automatically and is not user-specifiable.
>
> For example, with 4 GPUs (0-3) and 8 CPU ranks (0-7), the program binds
> ranks 0 and 4 to GPU 0, ranks 1 and 5 to GPU 1, and so on.
>
Yes, that is the current round-robin algorithm.  Do you want ranks 0,1 on
GPU 0,  and ranks 2, 3 on GPU 1, and so on?

> We tested possible solutions, such as calling cudaSetDevice() manually to
> set rank 4 to device 1, but it did not work as expected. Ranks 0 and 4
> still used GPU 0.
>
> We would appreciate your guidance on how to customize this mapping. Thank
> you for your support.
>
> Best wishes,
> Grant
>
>
> At 2025-11-12 11:48:47, "Junchao Zhang" <junchao.zhang at gmail.com>, said:
>
> Hi, Wenbo,
>    I think your approach should work.  But before going this extra step
> with gpu_comm,  have you tried to map multiple MPI ranks (CPUs) to one GPU,
> using nvidia's multiple process service (MPS)?  If MPS works well,  then
> you can avoid the extra complexity.
>
> --Junchao Zhang
>
>
> On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <zhaowenbo.npic at gmail.com>
> wrote:
>
>> Dear all,
>>
>> We are trying to solve ksp using GPUs.
>> We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which
>> the matrix is created and assembling using COO way provided by PETSc. In
>> this example, the number of CPU is as same as the number of GPU.
>> In our case, computation of the parameters of matrix is performed on
>> CPUs. And the cost of it is expensive, which might take half of total time
>> or even more.
>>
>>  We want to use more CPUs to compute parameters in parallel. And a
>> smaller communication domain (such as gpu_comm) for the CPUs corresponding
>> to the GPUs is created. The parameters are computed by all of the CPUs (in
>> MPI_COMM_WORLD). Then, the parameters are send to gpu_comm related CPUs via
>> MPI. Matrix (type of aijcusparse) is then created and assembled within
>> gpu_comm. Finally, ksp_solve is performed on GPUs.
>>
>> I’m not sure if this approach will work in practice. Are there any
>> comparable examples I can look to for guidance?
>>
>> Best,
>> Wenbo
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20251112/c4dae0c0/attachment.html>