[petsc-users] Fwd: Fw: gpu cpu parallel

Thu Nov 13 21:15:35 CST 2025

Glad to hear it works!

--Junchao Zhang

---------- Forwarded message ---------
From: Grace <amarantos at 126.com>
Date: Thu, Nov 13, 2025 at 9:05 PM
Subject: Fw: [petsc-users] gpu cpu parallel
To: junchao.zhang at gmail.com <junchao.zhang at gmail.com>

Hello, Junchao,

Thank you for your prompt help and the detailed solution.

We have tested the approach you suggested, using the set_gpu_device wrapper
script to control GPU visibility via CUDA_VISIBLE_DEVICES. It works
perfectly and now correctly maps the ranks to the intended GPUs as we
desired.

We really appreciate your guidance in resolving this issue.

Best regards,
Grace Gao
---- Forwarded Message ----
>From Grant Chao<grantchao2018 at 163.com> <grantchao2018 at 163.com>
Date 11/14/2025 08:40
To amarantos at 126.com
Cc
Subject Fw:Re: Re: [petsc-users] gpu cpu parallel

--
sent by my netease email phone version

-------- Forward mail content --------
From: "Junchao Zhang" <junchao.zhang at gmail.com>
Date: 2025-11-14 07:02:20
To: "Grant Chao" <grantchao2018 at 163.com>
CC: "Barry Smith" <bsmith at petsc.dev>,petsc-users <petsc-users at mcs.anl.gov>
Subject: Re: Re: [petsc-users] gpu cpu parallel
Hi, Grant,
  I could reproduce the issue with your code.  I think petsc code has some
problems and I created an issue at
https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/issues/1826__;!!G_uCfscf7eWS!cbp-7TpeWxYGe77z37dPkFan7mFckYzyKehvf7UVDZ8djmigGrIvIj5PYrJhgumqnq2Gi5vpjoVuOPymh3NEBptx_yOV$ .  Though we should fix it (not
sure how for now),  I think a much simpler approach is to use
CUDA_VISIBLE_DEVICES. For example, if you just want ranks 0, 4 to use GPUs
0, 1 respectively,  you can just delete these lines in your example
if (global_rank == 0) {
cudaSetDevice(0);
} else if (global_rank == 4) {
cudaSetDevice(1);
}

Then, instead, just make GPUs 0, 1 visible to ranks 0, 4 respectively
upfront, by

$ cat set_gpu_device
#!/bin/bash
# select_gpu_device wrapper script
export
CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/2)))
exec $*

$ mpirun -n 8 ./set_gpu_device  ./ex0
[Rank 5] no computation assigned.
[Rank 6] no computation assigned.
[Rank 7] no computation assigned.
[Rank 0] using GPU 0, [line 23].
[Rank 0] using GPU 0, [line 32] after setdevice.
[Rank 1] no computation assigned.
[Rank 2] no computation assigned.
[Rank 3] no computation assigned.
[Rank 4] using GPU 0, [line 23].
[Rank 4] using GPU 0, [line 32] after setdevice.
[Rank 0] using GPU 0, [line 42] after create A.
[Rank 4] using GPU 0, [line 42] after create A.
[Rank 4] using GPU 0, [line 46] after set A type.
[Rank 0] using GPU 0, [line 46] after set A type.
[Rank 0] using GPU 0, [line 50] after MatSetUp.
[Rank 4] using GPU 0, [line 50] after MatSetUp.
[Rank 0] using GPU 0, [line 63] after Mat Assemble.
[Rank 4] using GPU 0, [line 63] after Mat Assemble.
Smallest eigenvalue = 100.000000
Smallest eigenvalue = 100.000000

Note for rank 4, GPU 0 is actually the physical GPU 1.

Let me know if it works.
--Junchao Zhang

On Thu, Nov 13, 2025 at 11:17 AM Grant Chao <grantchao2018 at 163.com> wrote:

> Junchao,
> We have tried cudaSetDevice.
> The test code is attached. 8 cpu and 2 gpu are used. And we create a
> gpu_comm including rank 0 and rank 4.
> Then we set gpu 0 to rank 0, gpu 1 to rank 1 respectively.
> After MatSetType, rank 1 is mapped to gpu0 again.
>
> The run cmd is
>     mpirun -n 8 ./a.out -eps_type jd -st_ksp_type gmres -st_pc_type none
>
> The std out is show below,
> [Rank 0] using GPU 0, [line 22].
> [Rank 1] no computation assigned.
> [Rank 2] no computation assigned.
> [Rank 3] no computation assigned.
> [Rank 4] using GPU 0, [line 22].
> [Rank 5] no computation assigned.
> [Rank 6] no computation assigned.
> [Rank 7] no computation assigned.
> [Rank 4] using GPU 1, [line 31] after setdevice.   -------- Here set
> device successfully
> [Rank 0] using GPU 0, [line 31] after setdevice.
> [Rank 4] using GPU 1, [line 41] after create A.
> [Rank 0] using GPU 0, [line 41] after create A.
> [Rank 0] using GPU 0, [line 45] after set A type.
> [Rank 4] using GPU 0, [line 45] after set A type.      ------ change to 0?
> [Rank 4] using GPU 0, [line 49] after MatSetUp.
> [Rank 0] using GPU 0, [line 49] after MatSetUp.
> [Rank 4] using GPU 0, [line 62] after Mat Assemble.
> [Rank 0] using GPU 0, [line 62] after Mat Assemble.
> Smallest eigenvalue = 100.000000
> Smallest eigenvalue = 100.000000
>
> BEST,
> Grant
>
>
>
>
> At 2025-11-13 05:58:05, "Junchao Zhang" <junchao.zhang at gmail.com> wrote:
>
> A common approach is to use CUDA_VISIBLE_DEVICES to manipulate MPI ranks
> to GPUs mapping, see the section at
> https://urldefense.us/v3/__https://docs.nersc.gov/jobs/affinity/*gpu-nodes__;Iw!!G_uCfscf7eWS!cbp-7TpeWxYGe77z37dPkFan7mFckYzyKehvf7UVDZ8djmigGrIvIj5PYrJhgumqnq2Gi5vpjoVuOPymh3NEBtfb0PXl$ 
>
> With OpenMPI,  you can use OMPI_COMM_WORLD_LOCAL_RANK in place of
> SLURM_LOCALID (see
> https://urldefense.us/v3/__https://docs.open-mpi.org/en/v5.0.x/tuning-apps/environment-var.html__;!!G_uCfscf7eWS!cbp-7TpeWxYGe77z37dPkFan7mFckYzyKehvf7UVDZ8djmigGrIvIj5PYrJhgumqnq2Gi5vpjoVuOPymh3NEBgxjjYYZ$ ).
> For example, with 8 MPI ranks and 4 GPUs per node, the following script
> will map ranks 0, 1 to GPU 0, ranks 2, 3 to GPU 1.
>
> #!/bin/bash
> # select_gpu_device wrapper script
> export
> CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/4)))
> exec $*
>
> On Wed, Nov 12, 2025 at 10:20 AM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>
>> On Nov 12, 2025, at 2:31 AM, Grant Chao <grantchao2018 at 163.com> wrote:
>>
>>
>> Thank you for the suggestion.
>>
>> We have already tried running multiple CPU ranks with a single GPU.
>> However, we observed that as the number of ranks increases, the EPS solver
>> becomes significantly slower. We are not sure of the exact cause—could it
>> be due to process access contention, hidden data transfers, or perhaps
>> another reason? We would be very interested to hear your insight on this
>> matter.
>>
>> To avoid this problem, we used the gpu_comm approach mentioned before.
>> During testing, we noticed that the mapping between rank ID and GPU ID
>> seems to be set automatically and is not user-specifiable.
>>
>> For example, with 4 GPUs (0-3) and 8 CPU ranks (0-7), the program binds
>> ranks 0 and 4 to GPU 0, ranks 1 and 5 to GPU 1, and so on.
>>
>>
>>
>>
>> We tested possible solutions, such as calling cudaSetDevice() manually to
>> set rank 4 to device 1, but it did not work as expected. Ranks 0 and 4
>> still used GPU 0.
>>
>> We would appreciate your guidance on how to customize this mapping. Thank
>> you for your support.
>>
>>
>>   So you have a single compute "node" connected to multiple GPUs?  Then
>> the mapping of MPI ranks to GPUs doesn't matter and changing it won't
>> improve the performance.
>>
>
>> However, we observed that as the number of ranks increases, the EPS
>> solver becomes significantly slower.
>>
>>
>>   Does the number of EPS "iterations" increase? Run with one, two, four
>> and eight MPI ranks (and the same number of "GPUs" (if you only have say
>> four GPUs that is fine, just virtualize them so two different MPI ranks
>> share one) and the option -log_view and send the output. We need to know
>> what is slowing down before trying to find any cure.
>>
>>   Barry
>>
>>
>>
>>
>>
>> Best wishes,
>> Grant
>>
>>
>> At 2025-11-12 11:48:47, "Junchao Zhang" <junchao.zhang at gmail.com>, said:
>>
>> Hi, Wenbo,
>>    I think your approach should work.  But before going this extra step
>> with gpu_comm,  have you tried to map multiple MPI ranks (CPUs) to one GPU,
>> using nvidia's multiple process service (MPS)?  If MPS works well,  then
>> you can avoid the extra complexity.
>>
>> --Junchao Zhang
>>
>>
>> On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <zhaowenbo.npic at gmail.com>
>> wrote:
>>
>>> Dear all,
>>>
>>> We are trying to solve ksp using GPUs.
>>> We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which
>>> the matrix is created and assembling using COO way provided by PETSc. In
>>> this example, the number of CPU is as same as the number of GPU.
>>> In our case, computation of the parameters of matrix is performed on
>>> CPUs. And the cost of it is expensive, which might take half of total time
>>> or even more.
>>>
>>>  We want to use more CPUs to compute parameters in parallel. And a
>>> smaller communication domain (such as gpu_comm) for the CPUs corresponding
>>> to the GPUs is created. The parameters are computed by all of the CPUs (in
>>> MPI_COMM_WORLD). Then, the parameters are send to gpu_comm related CPUs via
>>> MPI. Matrix (type of aijcusparse) is then created and assembled within
>>> gpu_comm. Finally, ksp_solve is performed on GPUs.
>>>
>>> I’m not sure if this approach will work in practice. Are there any
>>> comparable examples I can look to for guidance?
>>>
>>> Best,
>>> Wenbo
>>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20251113/a8734dff/attachment-0001.html>