[petsc-users] gpu cpu parallel

Junchao Zhang junchao.zhang at gmail.com
Wed Nov 12 15:58:05 CST 2025


A common approach is to use CUDA_VISIBLE_DEVICES to manipulate MPI ranks to
GPUs mapping, see the section at
https://urldefense.us/v3/__https://docs.nersc.gov/jobs/affinity/*gpu-nodes__;Iw!!G_uCfscf7eWS!ags1Nog_0A9TnDudT9S81jm72t1NQYuOCg3--XMIlL4LXQCv-SFhCbQzesjgOxMAaRoyDOeYcqInlCRwOorJ0HFSR5q_$ 

With OpenMPI,  you can use OMPI_COMM_WORLD_LOCAL_RANK in place of
SLURM_LOCALID (see
https://urldefense.us/v3/__https://docs.open-mpi.org/en/v5.0.x/tuning-apps/environment-var.html__;!!G_uCfscf7eWS!ags1Nog_0A9TnDudT9S81jm72t1NQYuOCg3--XMIlL4LXQCv-SFhCbQzesjgOxMAaRoyDOeYcqInlCRwOorJ0DsDgr-l$ ). For
example, with 8 MPI ranks and 4 GPUs per node, the following script will
map ranks 0, 1 to GPU 0, ranks 2, 3 to GPU 1.

#!/bin/bash
# select_gpu_device wrapper script
export
CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/4)))
exec $*

On Wed, Nov 12, 2025 at 10:20 AM Barry Smith <bsmith at petsc.dev> wrote:

>
>
> On Nov 12, 2025, at 2:31 AM, Grant Chao <grantchao2018 at 163.com> wrote:
>
>
> Thank you for the suggestion.
>
> We have already tried running multiple CPU ranks with a single GPU.
> However, we observed that as the number of ranks increases, the EPS solver
> becomes significantly slower. We are not sure of the exact cause—could it
> be due to process access contention, hidden data transfers, or perhaps
> another reason? We would be very interested to hear your insight on this
> matter.
>
> To avoid this problem, we used the gpu_comm approach mentioned before.
> During testing, we noticed that the mapping between rank ID and GPU ID
> seems to be set automatically and is not user-specifiable.
>
> For example, with 4 GPUs (0-3) and 8 CPU ranks (0-7), the program binds
> ranks 0 and 4 to GPU 0, ranks 1 and 5 to GPU 1, and so on.
>
>
>
>
> We tested possible solutions, such as calling cudaSetDevice() manually to
> set rank 4 to device 1, but it did not work as expected. Ranks 0 and 4
> still used GPU 0.
>
> We would appreciate your guidance on how to customize this mapping. Thank
> you for your support.
>
>
>   So you have a single compute "node" connected to multiple GPUs?  Then
> the mapping of MPI ranks to GPUs doesn't matter and changing it won't
> improve the performance.
>

> However, we observed that as the number of ranks increases, the EPS solver
> becomes significantly slower.
>
>
>   Does the number of EPS "iterations" increase? Run with one, two, four
> and eight MPI ranks (and the same number of "GPUs" (if you only have say
> four GPUs that is fine, just virtualize them so two different MPI ranks
> share one) and the option -log_view and send the output. We need to know
> what is slowing down before trying to find any cure.
>
>   Barry
>
>
>
>
>
> Best wishes,
> Grant
>
>
> At 2025-11-12 11:48:47, "Junchao Zhang" <junchao.zhang at gmail.com>, said:
>
> Hi, Wenbo,
>    I think your approach should work.  But before going this extra step
> with gpu_comm,  have you tried to map multiple MPI ranks (CPUs) to one GPU,
> using nvidia's multiple process service (MPS)?  If MPS works well,  then
> you can avoid the extra complexity.
>
> --Junchao Zhang
>
>
> On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <zhaowenbo.npic at gmail.com>
> wrote:
>
>> Dear all,
>>
>> We are trying to solve ksp using GPUs.
>> We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which
>> the matrix is created and assembling using COO way provided by PETSc. In
>> this example, the number of CPU is as same as the number of GPU.
>> In our case, computation of the parameters of matrix is performed on
>> CPUs. And the cost of it is expensive, which might take half of total time
>> or even more.
>>
>>  We want to use more CPUs to compute parameters in parallel. And a
>> smaller communication domain (such as gpu_comm) for the CPUs corresponding
>> to the GPUs is created. The parameters are computed by all of the CPUs (in
>> MPI_COMM_WORLD). Then, the parameters are send to gpu_comm related CPUs via
>> MPI. Matrix (type of aijcusparse) is then created and assembled within
>> gpu_comm. Finally, ksp_solve is performed on GPUs.
>>
>> I’m not sure if this approach will work in practice. Are there any
>> comparable examples I can look to for guidance?
>>
>> Best,
>> Wenbo
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20251112/48ca7b5a/attachment.html>


More information about the petsc-users mailing list