[petsc-users] overlap cpu and gpu?

Mark Adams mfadams at lbl.gov
Mon Aug 17 08:40:36 CDT 2020


On Mon, Aug 17, 2020 at 9:24 AM nicola varini <nicola.varini at gmail.com>
wrote:

> Hi Mark, I do confirm that hypre with boomeramg is working fine and is
> pretty fast.
>

Good, you can send me the -info (grep GAMG) output and I try to see what is
going on.


> However, none of the GAMG option works.
> Did anyone ever succeeded in usign hypre with petsc on gpu?
>

We have gotten Hypre to run on GPUs but it has been fragile. The
performance has been marginal (due to use of USM apparently), but it is
being worked on by the hypre team.

The cude tools are changing fast and I am guessing this is a different
version than what we have tested, perhaps. Maybe someone else can help with
this, but I know we use cuda 10.2 and you are using cuda tools 10.1.

And you do want to use the most up-to-date PETSc.


> I did manage to compile hypre on gpu but I do get the following error:
> =======
>           CC gpuhypre/obj/vec/vec/impls/hypre/vhyp.o
> In file included from
> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:22,
>                  from
> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23,
>                  from /users/nvarini/hypre/include/_hypre_utilities.h:1129,
>                  from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14,
>                  from
> /scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6,
>                  from
> /scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7:
> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:83:1:
> error: unknown type name 'namespace'
>  namespace thrust
>  ^~~~~~~~~
> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:84:1:
> error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
>  {
>  ^
> In file included from
> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/config.h:28,
>                  from
> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:23,
>                  from
> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23,
>                  from /users/nvarini/hypre/include/_hypre_utilities.h:1129,
>                  from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14,
>                  from
> /scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6,
>                  from
> /scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7:
> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/cpp_compatibility.h:21:10:
> fatal error: cstddef: No such file or directory
>  #include <cstddef>
>           ^~~~~~~~~
> compilation terminated.
>
> =======
> Nicola
>
> Il giorno ven 14 ago 2020 alle ore 20:13 Mark Adams <mfadams at lbl.gov> ha
> scritto:
>
>> You can try Hypre. If that fails then there is a problem with your system.
>>
>> And you can run with -info and grep on GAMG and send the output and I can
>> see if I see anything funny.
>>
>> If this is just a Lapacian with a stable discretization and not crazy
>> material parameters then stretched grids are about the only thing that can
>> hurt the solver.
>>
>> Do both of your solves fail in a similar way?
>>
>> On the CPU you can try this with large subdomains, preferably (in serial
>> ideally):
>> -ampere_mg_levels_ksp_type richardson
>> -ampere_mg_levels_pc_type sor
>>
>> And check that there are no unused options with -options_left. GAMG can
>> fail with bad eigen estimates, but these parameters look fine.
>>
>> On Fri, Aug 14, 2020 at 5:01 AM nicola varini <nicola.varini at gmail.com>
>> wrote:
>>
>>> Dear Barry, yes it gives the same problems.
>>>
>>> Il giorno gio 13 ago 2020 alle ore 23:22 Barry Smith <bsmith at petsc.dev>
>>> ha scritto:
>>>
>>>>
>>>>    Does the same thing work (with GAMG) if you run on the same problem
>>>> on the same machine same number of MPI ranks but make a new PETSC_ARCH that
>>>> does NOT use the GPUs?
>>>>
>>>>    Barry
>>>>
>>>>    Ideally one gets almost identical convergence with CPUs or GPUs
>>>> (same problem, same machine) but a bug or numerically change "might" affect
>>>> this.
>>>>
>>>> On Aug 13, 2020, at 10:28 AM, nicola varini <nicola.varini at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Barry, you are right. The Cray argument checking is incorrect. It
>>>> does work with download-fblaslapack.
>>>> However it does fail to converge. Is there anything obviously wrong
>>>> with my petscrc?
>>>> Anything else am I missing?
>>>>
>>>> Thanks
>>>>
>>>> Il giorno gio 13 ago 2020 alle ore 03:17 Barry Smith <bsmith at petsc.dev>
>>>> ha scritto:
>>>>
>>>>>
>>>>>    The QR is always done on the CPU, we don't have generic calls to
>>>>> blas/lapack go to the GPU currently.
>>>>>
>>>>>    The error message is:
>>>>>
>>>>>    On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value
>>>>> (info = -7)
>>>>>
>>>>>    argument 7 is &LWORK which is defined by
>>>>>
>>>>>    PetscBLASInt   LWORK=N*bs;
>>>>>
>>>>>    and
>>>>>
>>>>>    N=nSAvec is the column block size of new P.
>>>>>
>>>>>    Presumably this is a huge run with many processes so using the
>>>>> debugger is not practical?
>>>>>
>>>>>    We need to see what these variables are
>>>>>
>>>>>     N, bs, nSAvec
>>>>>
>>>>>     perhaps nSAvec is zero which could easily upset LAPACK.
>>>>>
>>>>>     Crudest thing would be to just put a print statement in the code
>>>>> before the LAPACK call of if they are called many times add an error check
>>>>> like that
>>>>>     generates an error if any of these three values are 0 (or
>>>>> negative).
>>>>>
>>>>>    Barry
>>>>>
>>>>>
>>>>>     It is not impossible that the Cray argument checking is incorrect
>>>>> and the value passed in is fine. You can check this by using
>>>>> --download-fblaslapack and see if the same or some other error comes up.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Aug 12, 2020, at 7:19 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>> Can you reproduce this on the CPU?
>>>>> The QR factorization seems to be failing. That could be from bad data
>>>>> or a bad GPU QR.
>>>>>
>>>>> On Wed, Aug 12, 2020 at 4:19 AM nicola varini <nicola.varini at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dear all, following the suggestions I did resubmit the simulation
>>>>>> with the petscrc below.
>>>>>> However I do get the following error:
>>>>>> ========
>>>>>>  7362 [592]PETSC ERROR: #1 formProl0() line 748 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>   7363 [339]PETSC ERROR: Petsc has generated inconsistent data
>>>>>>   7364 [339]PETSC ERROR: xGEQRF error
>>>>>>   7365 [339]PETSC ERROR: See
>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>>>> shooting.
>>>>>>   7366 [339]PETSC ERROR: Petsc Release Version 3.13.3, Jul 01, 2020
>>>>>>   7367 [339]PETSC ERROR:
>>>>>> /users/nvarini/gbs_test_nicola/bin/gbs_daint_gpu_gnu on a  named nid05083
>>>>>> by nvarini Wed Aug 12 10:06:15 2020
>>>>>>   7368 [339]PETSC ERROR: Configure options --with-cc=cc --with-fc=ftn
>>>>>> --known-mpi-shared-libraries=1 --known-mpi-c-double-complex=1
>>>>>> --known-mpi-int64_t=1 --known-mpi-long-double=1 --with-batch=1
>>>>>> --known-64-bit-blas-indices=0 --LIBS=-lstdc++ --with-cxxlib-autodetect=0
>>>>>> --with-scalapa       ck=1 --with-cxx=CC --with-debugging=0
>>>>>> --with-hypre-dir=/opt/cray/pe/tpsl/19.06.1/GNU/8.2/haswell
>>>>>> --prefix=/scratch/snx3000/nvarini/petsc3.13.3-gpu --with-cuda=1
>>>>>> --with-cuda-c=nvcc --with-cxxlib-autodetect=0
>>>>>> --COPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include -
>>>>>> -with-cxx=CC
>>>>>> --CXXOPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include
>>>>>>   7369 [592]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>   7370 [592]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>>   7371 [592]PETSC ERROR: #4 PCSetUp() line 898 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>>>>>   7372 [592]PETSC ERROR: #5 KSPSetUp() line 376 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>   7373 [592]PETSC ERROR: #6 KSPSolve_Private() line 633 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>   7374 [316]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>>   7375 [339]PETSC ERROR: #1 formProl0() line 748 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>   7376 [339]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>   7377 [339]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>>   7378 [339]PETSC ERROR: #4 PCSetUp() line 898 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>>>>>   7379 [339]PETSC ERROR: #5 KSPSetUp() line 376 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>   7380 [592]PETSC ERROR: #7 KSPSolve() line 853 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>   7381 [339]PETSC ERROR: #6 KSPSolve_Private() line 633 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>   7382 [339]PETSC ERROR: #7 KSPSolve() line 853 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>   7383 On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal
>>>>>> value (info = -7)
>>>>>>   7384 [160]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>> ========
>>>>>>
>>>>>> I did try other pc_gamg_type but they fails as well.
>>>>>>
>>>>>>
>>>>>> #PETSc Option Table entries:
>>>>>> -ampere_dm_mat_type aijcusparse
>>>>>> -ampere_dm_vec_type cuda
>>>>>> -ampere_ksp_atol 1e-15
>>>>>> -ampere_ksp_initial_guess_nonzero yes
>>>>>> -ampere_ksp_reuse_preconditioner yes
>>>>>> -ampere_ksp_rtol 1e-7
>>>>>> -ampere_ksp_type dgmres
>>>>>> -ampere_mg_levels_esteig_ksp_max_it 10
>>>>>> -ampere_mg_levels_esteig_ksp_type cg
>>>>>> -ampere_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>>>>>> -ampere_mg_levels_ksp_type chebyshev
>>>>>> -ampere_mg_levels_pc_type jacobi
>>>>>> -ampere_pc_gamg_agg_nsmooths 1
>>>>>> -ampere_pc_gamg_coarse_eq_limit 10
>>>>>> -ampere_pc_gamg_reuse_interpolation true
>>>>>> -ampere_pc_gamg_square_graph 1
>>>>>> -ampere_pc_gamg_threshold 0.05
>>>>>> -ampere_pc_gamg_threshold_scale .0
>>>>>> -ampere_pc_gamg_type agg
>>>>>> -ampere_pc_type gamg
>>>>>> -dm_mat_type aijcusparse
>>>>>> -dm_vec_type cuda
>>>>>> -log_view
>>>>>> -poisson_dm_mat_type aijcusparse
>>>>>> -poisson_dm_vec_type cuda
>>>>>> -poisson_ksp_atol 1e-15
>>>>>> -poisson_ksp_initial_guess_nonzero yes
>>>>>> -poisson_ksp_reuse_preconditioner yes
>>>>>> -poisson_ksp_rtol 1e-7
>>>>>> -poisson_ksp_type dgmres
>>>>>> -poisson_log_view
>>>>>> -poisson_mg_levels_esteig_ksp_max_it 10
>>>>>> -poisson_mg_levels_esteig_ksp_type cg
>>>>>> -poisson_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>>>>>> -poisson_mg_levels_ksp_max_it 1
>>>>>> -poisson_mg_levels_ksp_type chebyshev
>>>>>> -poisson_mg_levels_pc_type jacobi
>>>>>> -poisson_pc_gamg_agg_nsmooths 1
>>>>>> -poisson_pc_gamg_coarse_eq_limit 10
>>>>>> -poisson_pc_gamg_reuse_interpolation true
>>>>>> -poisson_pc_gamg_square_graph 1
>>>>>> -poisson_pc_gamg_threshold 0.05
>>>>>> -poisson_pc_gamg_threshold_scale .0
>>>>>> -poisson_pc_gamg_type agg
>>>>>> -poisson_pc_type gamg
>>>>>> -use_mat_nearnullspace true
>>>>>> #End of PETSc Option Table entries
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Nicola
>>>>>>
>>>>>> Il giorno mar 4 ago 2020 alle ore 17:57 Mark Adams <mfadams at lbl.gov>
>>>>>> ha scritto:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 4, 2020 at 6:35 AM Stefano Zampini <
>>>>>>> stefano.zampini at gmail.com> wrote:
>>>>>>>
>>>>>>>> Nicola,
>>>>>>>>
>>>>>>>> You are actually not using the GPU properly, since you use HYPRE
>>>>>>>> preconditioning, which is CPU only. One of your solvers is actually slower
>>>>>>>> on “GPU”.
>>>>>>>> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and
>>>>>>>> with Jacobi preconditioning. Mark can help you out with the specific
>>>>>>>> command line options.
>>>>>>>> When it works properly, everything related to PC application is
>>>>>>>> offloaded to the GPU, and you should expect to get the well-known and
>>>>>>>> branded 10x (maybe more) speedup one is expecting from GPUs during KSPSolve
>>>>>>>>
>>>>>>>>
>>>>>>> The speedup depends on the machine, but on SUMMIT, using enough CPUs
>>>>>>> to saturate the memory bus vs all 6 GPUs the speedup is a function of
>>>>>>> problem subdomain size. I saw 10x at about 100K equations/process.
>>>>>>>
>>>>>>>
>>>>>>>> Doing what you want to do is one of the last optimization steps of
>>>>>>>> an already optimized code before entering production. Yours is not even
>>>>>>>> optimized for proper GPU usage  yet.
>>>>>>>> Also, any specific reason why you are using dgmres and fgmres?
>>>>>>>>
>>>>>>>> PETSc has not been designed with multi-threading in mind. You can
>>>>>>>> achieve “overlap” of the two solves by splitting the communicator. But then
>>>>>>>> you need communications to let the two solutions talk to each other.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 4, 2020, at 12:04 PM, nicola varini <nicola.varini at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Dear all, thanks for your replies. The reason why I've asked if it
>>>>>>>> is possible to overlap poisson and ampere is because they roughly
>>>>>>>> take the same amount of time. Please find in attachment the
>>>>>>>> profiling logs for only CPU  and only GPU.
>>>>>>>> Of course it is possible to split the MPI communicator and run each
>>>>>>>> solver on different subcommunicator, however this would involve more
>>>>>>>> communication.
>>>>>>>> Did anyone ever tried to run 2 solvers with hyperthreading?
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <mfadams at lbl.gov>
>>>>>>>> ha scritto:
>>>>>>>>
>>>>>>>>> I suspect that the Poisson and Ampere's law solve are not coupled.
>>>>>>>>> You might be able to duplicate the communicator and use two threads. You
>>>>>>>>> would want to configure PETSc with threadsafty and threads and I think it
>>>>>>>>> could/should work, but this mode is never used by anyone.
>>>>>>>>>
>>>>>>>>> That said, I would not recommend doing this unless you feel like
>>>>>>>>> playing in computer science, as opposed to doing application science. The
>>>>>>>>> best case scenario you get a speedup of 2x. That is a strict upper bound,
>>>>>>>>> but you will never come close to it. Your hardware has some balance of CPU
>>>>>>>>> to GPU processing rate. Your application has a balance of volume of work
>>>>>>>>> for your two solves. They have to be the same to get close to 2x speedup
>>>>>>>>> and that ratio(s) has to be 1:1. To be concrete, from what little I can
>>>>>>>>> guess about your applications let's assume that the cost of each of these
>>>>>>>>> two solves is about the same (eg, Laplacians on your domain and the best
>>>>>>>>> case scenario). But, GPU machines are configured to have roughly 1-10% of
>>>>>>>>> capacity in the GPUs, these days, that gives you an upper bound of about
>>>>>>>>> 10% speedup. That is noise. Upshot, unless you configure your hardware to
>>>>>>>>> match this problem, and the two solves have the same cost, you will not see
>>>>>>>>> close to 2x speedup. Your time is better spent elsewhere.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <jed at jedbrown.org> wrote:
>>>>>>>>>
>>>>>>>>>> You can use MPI and split the communicator so n-1 ranks create a
>>>>>>>>>> DMDA for one part of your system and the other rank drives the GPU in the
>>>>>>>>>> other part.  They can all be part of the same coupled system on the full
>>>>>>>>>> communicator, but PETSc doesn't currently support some ranks having their
>>>>>>>>>> Vec arrays on GPU and others on host, so you'd be paying host-device
>>>>>>>>>> transfer costs on each iteration (and that might swamp any performance
>>>>>>>>>> benefit you would have gotten).
>>>>>>>>>>
>>>>>>>>>> In any case, be sure to think about the execution time of each
>>>>>>>>>> part.  Load balancing with matching time-to-solution for each part can be
>>>>>>>>>> really hard.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Barry Smith <bsmith at petsc.dev> writes:
>>>>>>>>>>
>>>>>>>>>> >   Nicola,
>>>>>>>>>> >
>>>>>>>>>> >     This is really viable or practical at this time with PETSc.
>>>>>>>>>> It is not impossible but requires careful coding with threads, another
>>>>>>>>>> possibility is to use one half of the virtual GPUs for each solve, this is
>>>>>>>>>> also not trivial. I would recommend first seeing what kind of performance
>>>>>>>>>> you can get on the GPU for each type of solve and revist this idea in the
>>>>>>>>>> future.
>>>>>>>>>> >
>>>>>>>>>> >    Barry
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >> On Jul 31, 2020, at 9:23 AM, nicola varini <
>>>>>>>>>> nicola.varini at gmail.com> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> Hello, I would like to know if it is possible to overlap CPU
>>>>>>>>>> and GPU with DMDA.
>>>>>>>>>> >> I've a machine where each node has 1P100+1Haswell.
>>>>>>>>>> >> I've to resolve Poisson and Ampere equation for each time step.
>>>>>>>>>> >> I'm using 2D DMDA for each of them. Would be possible to
>>>>>>>>>> compute poisson
>>>>>>>>>> >> and ampere equation at the same time? One on CPU and the other
>>>>>>>>>> on GPU?
>>>>>>>>>> >>
>>>>>>>>>> >> Thanks
>>>>>>>>>>
>>>>>>>>> <out_gpu><out_nogpu>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>> <out_miniapp_f_poisson>
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200817/402046cd/attachment-0001.html>


More information about the petsc-users mailing list