[petsc-users] A series of GPU questions

Thu Jun 11 07:24:14 CDT 2020

>
>
>>
>> Would we instead just have 40 (or perhaps slightly fewer) MPI processes
>> all sharing the GPUs? Surely this would be inefficient, and would PETSc
>> distribute the work across all 4 GPUs, or would every process end out using
>> a single GPU?
>>
> See
> https://docs.olcf.ornl.gov/systems/summit_user_guide.html#volta-multi-process-service.
>
>

I'll jump in here but I would recommend not worrying about the number of
GPUs and MPI processes (and don't bother with OMP).

As the MPS link above shows MPS wants to allow for slicing/scheduling the
GPU in space and/or time. That is, very flexible. One would hope that this
will get better and adapt to new hardware and so that your code does not
have to.

I would focus on getting as much parallelism in your code as possible. GPUs
need a lot of threads to run well and with DNS you might have a chance to
feed it properly, but I'd just try to get as much as you can.

> In some cases, we did see better performance with multiple mpi ranks/GPU
> than 1 rank/GPU. The optimal configuration depends on the code. Think two
> extremes:  One code with work done all on GPU and the other all on CPU.
> Probably you only need 1 mpi rank/node for the former, but full ranks for
> the latter.
>

Another dimension is assuming all work is on the GPU, at least
asymptotically, then it's a matter of how much parallelism you have. (OK,
not that simple ...) At one extreme you have one giant GPU, in which case
you probably want to use multiple ranks and hope MPS can slice the GPU up
in space to make it look like multiple GPUs of the right size for me.

Anecdotally, I have a kernel that is a solver in velocity space that sits
in a phase space application (configuration space X and velocity space V)
with a tensor decomposition (so the solves between X and V are not
coupled). My V space solver is expensive (maybe like complex chemistry in
DNS that is independent of the spacial solver) and on smallish problems
(less parallelism available) I see an increase of throughput of 5x in going
from 1 to 7 cores (MPI ranks) / GPU (IBM/Nvidia, 42 cores and 6 GPU, per
node), just running the same problem, embarrassingly parallel. I increased
the problem work by 16x and I still got 3x throughput speedup going from 1
to 7 cores.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200611/229615b4/attachment.html>