[petsc-users] overlap cpu and gpu?

Tue Aug 4 05:46:21 CDT 2020

Thanks for your reply Stefano. I know that HYPRE is not ported on GPU, but
the Solver is running on GPU and is taking ~9s and is showing 100% of GPU
utilization.

Il giorno mar 4 ago 2020 alle ore 12:35 Stefano Zampini <
stefano.zampini at gmail.com> ha scritto:

> Nicola,
>
> You are actually not using the GPU properly, since you use HYPRE
> preconditioning, which is CPU only. One of your solvers is actually slower
> on “GPU”.
> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and with
> Jacobi preconditioning. Mark can help you out with the specific command
> line options.
> When it works properly, everything related to PC application is offloaded
> to the GPU, and you should expect to get the well-known and branded 10x
> (maybe more) speedup one is expecting from GPUs during KSPSolve
>
> Doing what you want to do is one of the last optimization steps of an
> already optimized code before entering production. Yours is not even
> optimized for proper GPU usage  yet.
> Also, any specific reason why you are using dgmres and fgmres?
>
> PETSc has not been designed with multi-threading in mind. You can achieve
> “overlap” of the two solves by splitting the communicator. But then you
> need communications to let the two solutions talk to each other.
>
> Thanks
> Stefano
>
>
> On Aug 4, 2020, at 12:04 PM, nicola varini <nicola.varini at gmail.com>
> wrote:
>
> Dear all, thanks for your replies. The reason why I've asked if it is
> possible to overlap poisson and ampere is because they roughly
> take the same amount of time. Please find in attachment the profiling logs
> for only CPU  and only GPU.
> Of course it is possible to split the MPI communicator and run each solver
> on different subcommunicator, however this would involve more communication.
> Did anyone ever tried to run 2 solvers with hyperthreading?
> Thanks
>
>
> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <mfadams at lbl.gov> ha
> scritto:
>
>> I suspect that the Poisson and Ampere's law solve are not coupled. You
>> might be able to duplicate the communicator and use two threads. You would
>> want to configure PETSc with threadsafty and threads and I think it
>> could/should work, but this mode is never used by anyone.
>>
>> That said, I would not recommend doing this unless you feel like playing
>> in computer science, as opposed to doing application science. The best case
>> scenario you get a speedup of 2x. That is a strict upper bound, but you
>> will never come close to it. Your hardware has some balance of CPU to GPU
>> processing rate. Your application has a balance of volume of work for your
>> two solves. They have to be the same to get close to 2x speedup and that
>> ratio(s) has to be 1:1. To be concrete, from what little I can guess about
>> your applications let's assume that the cost of each of these two solves is
>> about the same (eg, Laplacians on your domain and the best case scenario).
>> But, GPU machines are configured to have roughly 1-10% of capacity in the
>> GPUs, these days, that gives you an upper bound of about 10% speedup. That
>> is noise. Upshot, unless you configure your hardware to match this problem,
>> and the two solves have the same cost, you will not see close to 2x
>> speedup. Your time is better spent elsewhere.
>>
>> Mark
>>
>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <jed at jedbrown.org> wrote:
>>
>>> You can use MPI and split the communicator so n-1 ranks create a DMDA
>>> for one part of your system and the other rank drives the GPU in the other
>>> part.  They can all be part of the same coupled system on the full
>>> communicator, but PETSc doesn't currently support some ranks having their
>>> Vec arrays on GPU and others on host, so you'd be paying host-device
>>> transfer costs on each iteration (and that might swamp any performance
>>> benefit you would have gotten).
>>>
>>> In any case, be sure to think about the execution time of each part.
>>> Load balancing with matching time-to-solution for each part can be really
>>> hard.
>>>
>>>
>>> Barry Smith <bsmith at petsc.dev> writes:
>>>
>>> >   Nicola,
>>> >
>>> >     This is really viable or practical at this time with PETSc. It is
>>> not impossible but requires careful coding with threads, another
>>> possibility is to use one half of the virtual GPUs for each solve, this is
>>> also not trivial. I would recommend first seeing what kind of performance
>>> you can get on the GPU for each type of solve and revist this idea in the
>>> future.
>>> >
>>> >    Barry
>>> >
>>> >
>>> >
>>> >
>>> >> On Jul 31, 2020, at 9:23 AM, nicola varini <nicola.varini at gmail.com>
>>> wrote:
>>> >>
>>> >> Hello, I would like to know if it is possible to overlap CPU and GPU
>>> with DMDA.
>>> >> I've a machine where each node has 1P100+1Haswell.
>>> >> I've to resolve Poisson and Ampere equation for each time step.
>>> >> I'm using 2D DMDA for each of them. Would be possible to compute
>>> poisson
>>> >> and ampere equation at the same time? One on CPU and the other on GPU?
>>> >>
>>> >> Thanks
>>>
>> <out_gpu><out_nogpu>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200804/bc06a1ee/attachment.html>