[petsc-users] overlap cpu and gpu?

Fri Aug 14 13:12:57 CDT 2020

You can try Hypre. If that fails then there is a problem with your system.

And you can run with -info and grep on GAMG and send the output and I can
see if I see anything funny.

If this is just a Lapacian with a stable discretization and not crazy
material parameters then stretched grids are about the only thing that can
hurt the solver.

Do both of your solves fail in a similar way?

On the CPU you can try this with large subdomains, preferably (in serial
ideally):
-ampere_mg_levels_ksp_type richardson
-ampere_mg_levels_pc_type sor

And check that there are no unused options with -options_left. GAMG can
fail with bad eigen estimates, but these parameters look fine.

On Fri, Aug 14, 2020 at 5:01 AM nicola varini <nicola.varini at gmail.com>
wrote:

> Dear Barry, yes it gives the same problems.
>
> Il giorno gio 13 ago 2020 alle ore 23:22 Barry Smith <bsmith at petsc.dev>
> ha scritto:
>
>>
>>    Does the same thing work (with GAMG) if you run on the same problem on
>> the same machine same number of MPI ranks but make a new PETSC_ARCH that
>> does NOT use the GPUs?
>>
>>    Barry
>>
>>    Ideally one gets almost identical convergence with CPUs or GPUs (same
>> problem, same machine) but a bug or numerically change "might" affect this.
>>
>> On Aug 13, 2020, at 10:28 AM, nicola varini <nicola.varini at gmail.com>
>> wrote:
>>
>> Dear Barry, you are right. The Cray argument checking is incorrect. It
>> does work with download-fblaslapack.
>> However it does fail to converge. Is there anything obviously wrong with
>> my petscrc?
>> Anything else am I missing?
>>
>> Thanks
>>
>> Il giorno gio 13 ago 2020 alle ore 03:17 Barry Smith <bsmith at petsc.dev>
>> ha scritto:
>>
>>>
>>>    The QR is always done on the CPU, we don't have generic calls to
>>> blas/lapack go to the GPU currently.
>>>
>>>    The error message is:
>>>
>>>    On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value (info
>>> = -7)
>>>
>>>    argument 7 is &LWORK which is defined by
>>>
>>>    PetscBLASInt   LWORK=N*bs;
>>>
>>>    and
>>>
>>>    N=nSAvec is the column block size of new P.
>>>
>>>    Presumably this is a huge run with many processes so using the
>>> debugger is not practical?
>>>
>>>    We need to see what these variables are
>>>
>>>     N, bs, nSAvec
>>>
>>>     perhaps nSAvec is zero which could easily upset LAPACK.
>>>
>>>     Crudest thing would be to just put a print statement in the code
>>> before the LAPACK call of if they are called many times add an error check
>>> like that
>>>     generates an error if any of these three values are 0 (or negative).
>>>
>>>    Barry
>>>
>>>
>>>     It is not impossible that the Cray argument checking is incorrect
>>> and the value passed in is fine. You can check this by using
>>> --download-fblaslapack and see if the same or some other error comes up.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Aug 12, 2020, at 7:19 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>> Can you reproduce this on the CPU?
>>> The QR factorization seems to be failing. That could be from bad data or
>>> a bad GPU QR.
>>>
>>> On Wed, Aug 12, 2020 at 4:19 AM nicola varini <nicola.varini at gmail.com>
>>> wrote:
>>>
>>>> Dear all, following the suggestions I did resubmit the simulation with
>>>> the petscrc below.
>>>> However I do get the following error:
>>>> ========
>>>>  7362 [592]PETSC ERROR: #1 formProl0() line 748 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>   7363 [339]PETSC ERROR: Petsc has generated inconsistent data
>>>>   7364 [339]PETSC ERROR: xGEQRF error
>>>>   7365 [339]PETSC ERROR: See
>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>> shooting.
>>>>   7366 [339]PETSC ERROR: Petsc Release Version 3.13.3, Jul 01, 2020
>>>>   7367 [339]PETSC ERROR:
>>>> /users/nvarini/gbs_test_nicola/bin/gbs_daint_gpu_gnu on a  named nid05083
>>>> by nvarini Wed Aug 12 10:06:15 2020
>>>>   7368 [339]PETSC ERROR: Configure options --with-cc=cc --with-fc=ftn
>>>> --known-mpi-shared-libraries=1 --known-mpi-c-double-complex=1
>>>> --known-mpi-int64_t=1 --known-mpi-long-double=1 --with-batch=1
>>>> --known-64-bit-blas-indices=0 --LIBS=-lstdc++ --with-cxxlib-autodetect=0
>>>> --with-scalapa       ck=1 --with-cxx=CC --with-debugging=0
>>>> --with-hypre-dir=/opt/cray/pe/tpsl/19.06.1/GNU/8.2/haswell
>>>> --prefix=/scratch/snx3000/nvarini/petsc3.13.3-gpu --with-cuda=1
>>>> --with-cuda-c=nvcc --with-cxxlib-autodetect=0
>>>> --COPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include -
>>>> -with-cxx=CC
>>>> --CXXOPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include
>>>>   7369 [592]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>   7370 [592]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>   7371 [592]PETSC ERROR: #4 PCSetUp() line 898 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>>>   7372 [592]PETSC ERROR: #5 KSPSetUp() line 376 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>   7373 [592]PETSC ERROR: #6 KSPSolve_Private() line 633 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>   7374 [316]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>   7375 [339]PETSC ERROR: #1 formProl0() line 748 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>   7376 [339]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>   7377 [339]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>   7378 [339]PETSC ERROR: #4 PCSetUp() line 898 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>>>   7379 [339]PETSC ERROR: #5 KSPSetUp() line 376 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>   7380 [592]PETSC ERROR: #7 KSPSolve() line 853 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>   7381 [339]PETSC ERROR: #6 KSPSolve_Private() line 633 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>   7382 [339]PETSC ERROR: #7 KSPSolve() line 853 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>   7383 On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value
>>>> (info = -7)
>>>>   7384 [160]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>> ========
>>>>
>>>> I did try other pc_gamg_type but they fails as well.
>>>>
>>>>
>>>> #PETSc Option Table entries:
>>>> -ampere_dm_mat_type aijcusparse
>>>> -ampere_dm_vec_type cuda
>>>> -ampere_ksp_atol 1e-15
>>>> -ampere_ksp_initial_guess_nonzero yes
>>>> -ampere_ksp_reuse_preconditioner yes
>>>> -ampere_ksp_rtol 1e-7
>>>> -ampere_ksp_type dgmres
>>>> -ampere_mg_levels_esteig_ksp_max_it 10
>>>> -ampere_mg_levels_esteig_ksp_type cg
>>>> -ampere_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>>>> -ampere_mg_levels_ksp_type chebyshev
>>>> -ampere_mg_levels_pc_type jacobi
>>>> -ampere_pc_gamg_agg_nsmooths 1
>>>> -ampere_pc_gamg_coarse_eq_limit 10
>>>> -ampere_pc_gamg_reuse_interpolation true
>>>> -ampere_pc_gamg_square_graph 1
>>>> -ampere_pc_gamg_threshold 0.05
>>>> -ampere_pc_gamg_threshold_scale .0
>>>> -ampere_pc_gamg_type agg
>>>> -ampere_pc_type gamg
>>>> -dm_mat_type aijcusparse
>>>> -dm_vec_type cuda
>>>> -log_view
>>>> -poisson_dm_mat_type aijcusparse
>>>> -poisson_dm_vec_type cuda
>>>> -poisson_ksp_atol 1e-15
>>>> -poisson_ksp_initial_guess_nonzero yes
>>>> -poisson_ksp_reuse_preconditioner yes
>>>> -poisson_ksp_rtol 1e-7
>>>> -poisson_ksp_type dgmres
>>>> -poisson_log_view
>>>> -poisson_mg_levels_esteig_ksp_max_it 10
>>>> -poisson_mg_levels_esteig_ksp_type cg
>>>> -poisson_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>>>> -poisson_mg_levels_ksp_max_it 1
>>>> -poisson_mg_levels_ksp_type chebyshev
>>>> -poisson_mg_levels_pc_type jacobi
>>>> -poisson_pc_gamg_agg_nsmooths 1
>>>> -poisson_pc_gamg_coarse_eq_limit 10
>>>> -poisson_pc_gamg_reuse_interpolation true
>>>> -poisson_pc_gamg_square_graph 1
>>>> -poisson_pc_gamg_threshold 0.05
>>>> -poisson_pc_gamg_threshold_scale .0
>>>> -poisson_pc_gamg_type agg
>>>> -poisson_pc_type gamg
>>>> -use_mat_nearnullspace true
>>>> #End of PETSc Option Table entries
>>>>
>>>> Regards,
>>>>
>>>> Nicola
>>>>
>>>> Il giorno mar 4 ago 2020 alle ore 17:57 Mark Adams <mfadams at lbl.gov>
>>>> ha scritto:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 4, 2020 at 6:35 AM Stefano Zampini <
>>>>> stefano.zampini at gmail.com> wrote:
>>>>>
>>>>>> Nicola,
>>>>>>
>>>>>> You are actually not using the GPU properly, since you use HYPRE
>>>>>> preconditioning, which is CPU only. One of your solvers is actually slower
>>>>>> on “GPU”.
>>>>>> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and with
>>>>>> Jacobi preconditioning. Mark can help you out with the specific command
>>>>>> line options.
>>>>>> When it works properly, everything related to PC application is
>>>>>> offloaded to the GPU, and you should expect to get the well-known and
>>>>>> branded 10x (maybe more) speedup one is expecting from GPUs during KSPSolve
>>>>>>
>>>>>>
>>>>> The speedup depends on the machine, but on SUMMIT, using enough CPUs
>>>>> to saturate the memory bus vs all 6 GPUs the speedup is a function of
>>>>> problem subdomain size. I saw 10x at about 100K equations/process.
>>>>>
>>>>>
>>>>>> Doing what you want to do is one of the last optimization steps of an
>>>>>> already optimized code before entering production. Yours is not even
>>>>>> optimized for proper GPU usage  yet.
>>>>>> Also, any specific reason why you are using dgmres and fgmres?
>>>>>>
>>>>>> PETSc has not been designed with multi-threading in mind. You can
>>>>>> achieve “overlap” of the two solves by splitting the communicator. But then
>>>>>> you need communications to let the two solutions talk to each other.
>>>>>>
>>>>>> Thanks
>>>>>> Stefano
>>>>>>
>>>>>>
>>>>>> On Aug 4, 2020, at 12:04 PM, nicola varini <nicola.varini at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dear all, thanks for your replies. The reason why I've asked if it is
>>>>>> possible to overlap poisson and ampere is because they roughly
>>>>>> take the same amount of time. Please find in attachment the profiling
>>>>>> logs for only CPU  and only GPU.
>>>>>> Of course it is possible to split the MPI communicator and run each
>>>>>> solver on different subcommunicator, however this would involve more
>>>>>> communication.
>>>>>> Did anyone ever tried to run 2 solvers with hyperthreading?
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <mfadams at lbl.gov>
>>>>>> ha scritto:
>>>>>>
>>>>>>> I suspect that the Poisson and Ampere's law solve are not coupled.
>>>>>>> You might be able to duplicate the communicator and use two threads. You
>>>>>>> would want to configure PETSc with threadsafty and threads and I think it
>>>>>>> could/should work, but this mode is never used by anyone.
>>>>>>>
>>>>>>> That said, I would not recommend doing this unless you feel like
>>>>>>> playing in computer science, as opposed to doing application science. The
>>>>>>> best case scenario you get a speedup of 2x. That is a strict upper bound,
>>>>>>> but you will never come close to it. Your hardware has some balance of CPU
>>>>>>> to GPU processing rate. Your application has a balance of volume of work
>>>>>>> for your two solves. They have to be the same to get close to 2x speedup
>>>>>>> and that ratio(s) has to be 1:1. To be concrete, from what little I can
>>>>>>> guess about your applications let's assume that the cost of each of these
>>>>>>> two solves is about the same (eg, Laplacians on your domain and the best
>>>>>>> case scenario). But, GPU machines are configured to have roughly 1-10% of
>>>>>>> capacity in the GPUs, these days, that gives you an upper bound of about
>>>>>>> 10% speedup. That is noise. Upshot, unless you configure your hardware to
>>>>>>> match this problem, and the two solves have the same cost, you will not see
>>>>>>> close to 2x speedup. Your time is better spent elsewhere.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <jed at jedbrown.org> wrote:
>>>>>>>
>>>>>>>> You can use MPI and split the communicator so n-1 ranks create a
>>>>>>>> DMDA for one part of your system and the other rank drives the GPU in the
>>>>>>>> other part.  They can all be part of the same coupled system on the full
>>>>>>>> communicator, but PETSc doesn't currently support some ranks having their
>>>>>>>> Vec arrays on GPU and others on host, so you'd be paying host-device
>>>>>>>> transfer costs on each iteration (and that might swamp any performance
>>>>>>>> benefit you would have gotten).
>>>>>>>>
>>>>>>>> In any case, be sure to think about the execution time of each
>>>>>>>> part.  Load balancing with matching time-to-solution for each part can be
>>>>>>>> really hard.
>>>>>>>>
>>>>>>>>
>>>>>>>> Barry Smith <bsmith at petsc.dev> writes:
>>>>>>>>
>>>>>>>> >   Nicola,
>>>>>>>> >
>>>>>>>> >     This is really viable or practical at this time with PETSc.
>>>>>>>> It is not impossible but requires careful coding with threads, another
>>>>>>>> possibility is to use one half of the virtual GPUs for each solve, this is
>>>>>>>> also not trivial. I would recommend first seeing what kind of performance
>>>>>>>> you can get on the GPU for each type of solve and revist this idea in the
>>>>>>>> future.
>>>>>>>> >
>>>>>>>> >    Barry
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >> On Jul 31, 2020, at 9:23 AM, nicola varini <
>>>>>>>> nicola.varini at gmail.com> wrote:
>>>>>>>> >>
>>>>>>>> >> Hello, I would like to know if it is possible to overlap CPU and
>>>>>>>> GPU with DMDA.
>>>>>>>> >> I've a machine where each node has 1P100+1Haswell.
>>>>>>>> >> I've to resolve Poisson and Ampere equation for each time step.
>>>>>>>> >> I'm using 2D DMDA for each of them. Would be possible to compute
>>>>>>>> poisson
>>>>>>>> >> and ampere equation at the same time? One on CPU and the other
>>>>>>>> on GPU?
>>>>>>>> >>
>>>>>>>> >> Thanks
>>>>>>>>
>>>>>>> <out_gpu><out_nogpu>
>>>>>>
>>>>>>
>>>>>>
>>> <out_miniapp_f_poisson>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200814/ae1ff464/attachment-0001.html>