[petsc-users] overlap cpu and gpu?

Wed Aug 12 19:19:59 CDT 2020

Can you reproduce this on the CPU?
The QR factorization seems to be failing. That could be from bad data or a
bad GPU QR.

On Wed, Aug 12, 2020 at 4:19 AM nicola varini <nicola.varini at gmail.com>
wrote:

> Dear all, following the suggestions I did resubmit the simulation with the
> petscrc below.
> However I do get the following error:
> ========
>  7362 [592]PETSC ERROR: #1 formProl0() line 748 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>   7363 [339]PETSC ERROR: Petsc has generated inconsistent data
>   7364 [339]PETSC ERROR: xGEQRF error
>   7365 [339]PETSC ERROR: See
> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
>   7366 [339]PETSC ERROR: Petsc Release Version 3.13.3, Jul 01, 2020
>   7367 [339]PETSC ERROR:
> /users/nvarini/gbs_test_nicola/bin/gbs_daint_gpu_gnu on a  named nid05083
> by nvarini Wed Aug 12 10:06:15 2020
>   7368 [339]PETSC ERROR: Configure options --with-cc=cc --with-fc=ftn
> --known-mpi-shared-libraries=1 --known-mpi-c-double-complex=1
> --known-mpi-int64_t=1 --known-mpi-long-double=1 --with-batch=1
> --known-64-bit-blas-indices=0 --LIBS=-lstdc++ --with-cxxlib-autodetect=0
> --with-scalapa       ck=1 --with-cxx=CC --with-debugging=0
> --with-hypre-dir=/opt/cray/pe/tpsl/19.06.1/GNU/8.2/haswell
> --prefix=/scratch/snx3000/nvarini/petsc3.13.3-gpu --with-cuda=1
> --with-cuda-c=nvcc --with-cxxlib-autodetect=0
> --COPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include -
> -with-cxx=CC
> --CXXOPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include
>   7369 [592]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>   7370 [592]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>   7371 [592]PETSC ERROR: #4 PCSetUp() line 898 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>   7372 [592]PETSC ERROR: #5 KSPSetUp() line 376 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>   7373 [592]PETSC ERROR: #6 KSPSolve_Private() line 633 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>   7374 [316]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>   7375 [339]PETSC ERROR: #1 formProl0() line 748 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>   7376 [339]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>   7377 [339]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>   7378 [339]PETSC ERROR: #4 PCSetUp() line 898 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>   7379 [339]PETSC ERROR: #5 KSPSetUp() line 376 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>   7380 [592]PETSC ERROR: #7 KSPSolve() line 853 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>   7381 [339]PETSC ERROR: #6 KSPSolve_Private() line 633 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>   7382 [339]PETSC ERROR: #7 KSPSolve() line 853 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>   7383 On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value
> (info = -7)
>   7384 [160]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
> ========
>
> I did try other pc_gamg_type but they fails as well.
>
>
> #PETSc Option Table entries:
> -ampere_dm_mat_type aijcusparse
> -ampere_dm_vec_type cuda
> -ampere_ksp_atol 1e-15
> -ampere_ksp_initial_guess_nonzero yes
> -ampere_ksp_reuse_preconditioner yes
> -ampere_ksp_rtol 1e-7
> -ampere_ksp_type dgmres
> -ampere_mg_levels_esteig_ksp_max_it 10
> -ampere_mg_levels_esteig_ksp_type cg
> -ampere_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
> -ampere_mg_levels_ksp_type chebyshev
> -ampere_mg_levels_pc_type jacobi
> -ampere_pc_gamg_agg_nsmooths 1
> -ampere_pc_gamg_coarse_eq_limit 10
> -ampere_pc_gamg_reuse_interpolation true
> -ampere_pc_gamg_square_graph 1
> -ampere_pc_gamg_threshold 0.05
> -ampere_pc_gamg_threshold_scale .0
> -ampere_pc_gamg_type agg
> -ampere_pc_type gamg
> -dm_mat_type aijcusparse
> -dm_vec_type cuda
> -log_view
> -poisson_dm_mat_type aijcusparse
> -poisson_dm_vec_type cuda
> -poisson_ksp_atol 1e-15
> -poisson_ksp_initial_guess_nonzero yes
> -poisson_ksp_reuse_preconditioner yes
> -poisson_ksp_rtol 1e-7
> -poisson_ksp_type dgmres
> -poisson_log_view
> -poisson_mg_levels_esteig_ksp_max_it 10
> -poisson_mg_levels_esteig_ksp_type cg
> -poisson_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
> -poisson_mg_levels_ksp_max_it 1
> -poisson_mg_levels_ksp_type chebyshev
> -poisson_mg_levels_pc_type jacobi
> -poisson_pc_gamg_agg_nsmooths 1
> -poisson_pc_gamg_coarse_eq_limit 10
> -poisson_pc_gamg_reuse_interpolation true
> -poisson_pc_gamg_square_graph 1
> -poisson_pc_gamg_threshold 0.05
> -poisson_pc_gamg_threshold_scale .0
> -poisson_pc_gamg_type agg
> -poisson_pc_type gamg
> -use_mat_nearnullspace true
> #End of PETSc Option Table entries
>
> Regards,
>
> Nicola
>
> Il giorno mar 4 ago 2020 alle ore 17:57 Mark Adams <mfadams at lbl.gov> ha
> scritto:
>
>>
>>
>> On Tue, Aug 4, 2020 at 6:35 AM Stefano Zampini <stefano.zampini at gmail.com>
>> wrote:
>>
>>> Nicola,
>>>
>>> You are actually not using the GPU properly, since you use HYPRE
>>> preconditioning, which is CPU only. One of your solvers is actually slower
>>> on “GPU”.
>>> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and with
>>> Jacobi preconditioning. Mark can help you out with the specific command
>>> line options.
>>> When it works properly, everything related to PC application is
>>> offloaded to the GPU, and you should expect to get the well-known and
>>> branded 10x (maybe more) speedup one is expecting from GPUs during KSPSolve
>>>
>>>
>> The speedup depends on the machine, but on SUMMIT, using enough CPUs to
>> saturate the memory bus vs all 6 GPUs the speedup is a function of problem
>> subdomain size. I saw 10x at about 100K equations/process.
>>
>>
>>> Doing what you want to do is one of the last optimization steps of an
>>> already optimized code before entering production. Yours is not even
>>> optimized for proper GPU usage  yet.
>>> Also, any specific reason why you are using dgmres and fgmres?
>>>
>>> PETSc has not been designed with multi-threading in mind. You can
>>> achieve “overlap” of the two solves by splitting the communicator. But then
>>> you need communications to let the two solutions talk to each other.
>>>
>>> Thanks
>>> Stefano
>>>
>>>
>>> On Aug 4, 2020, at 12:04 PM, nicola varini <nicola.varini at gmail.com>
>>> wrote:
>>>
>>> Dear all, thanks for your replies. The reason why I've asked if it is
>>> possible to overlap poisson and ampere is because they roughly
>>> take the same amount of time. Please find in attachment the profiling
>>> logs for only CPU  and only GPU.
>>> Of course it is possible to split the MPI communicator and run each
>>> solver on different subcommunicator, however this would involve more
>>> communication.
>>> Did anyone ever tried to run 2 solvers with hyperthreading?
>>> Thanks
>>>
>>>
>>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <mfadams at lbl.gov> ha
>>> scritto:
>>>
>>>> I suspect that the Poisson and Ampere's law solve are not coupled. You
>>>> might be able to duplicate the communicator and use two threads. You would
>>>> want to configure PETSc with threadsafty and threads and I think it
>>>> could/should work, but this mode is never used by anyone.
>>>>
>>>> That said, I would not recommend doing this unless you feel like
>>>> playing in computer science, as opposed to doing application science. The
>>>> best case scenario you get a speedup of 2x. That is a strict upper bound,
>>>> but you will never come close to it. Your hardware has some balance of CPU
>>>> to GPU processing rate. Your application has a balance of volume of work
>>>> for your two solves. They have to be the same to get close to 2x speedup
>>>> and that ratio(s) has to be 1:1. To be concrete, from what little I can
>>>> guess about your applications let's assume that the cost of each of these
>>>> two solves is about the same (eg, Laplacians on your domain and the best
>>>> case scenario). But, GPU machines are configured to have roughly 1-10% of
>>>> capacity in the GPUs, these days, that gives you an upper bound of about
>>>> 10% speedup. That is noise. Upshot, unless you configure your hardware to
>>>> match this problem, and the two solves have the same cost, you will not see
>>>> close to 2x speedup. Your time is better spent elsewhere.
>>>>
>>>> Mark
>>>>
>>>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <jed at jedbrown.org> wrote:
>>>>
>>>>> You can use MPI and split the communicator so n-1 ranks create a DMDA
>>>>> for one part of your system and the other rank drives the GPU in the other
>>>>> part.  They can all be part of the same coupled system on the full
>>>>> communicator, but PETSc doesn't currently support some ranks having their
>>>>> Vec arrays on GPU and others on host, so you'd be paying host-device
>>>>> transfer costs on each iteration (and that might swamp any performance
>>>>> benefit you would have gotten).
>>>>>
>>>>> In any case, be sure to think about the execution time of each part.
>>>>> Load balancing with matching time-to-solution for each part can be really
>>>>> hard.
>>>>>
>>>>>
>>>>> Barry Smith <bsmith at petsc.dev> writes:
>>>>>
>>>>> >   Nicola,
>>>>> >
>>>>> >     This is really viable or practical at this time with PETSc. It
>>>>> is not impossible but requires careful coding with threads, another
>>>>> possibility is to use one half of the virtual GPUs for each solve, this is
>>>>> also not trivial. I would recommend first seeing what kind of performance
>>>>> you can get on the GPU for each type of solve and revist this idea in the
>>>>> future.
>>>>> >
>>>>> >    Barry
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >> On Jul 31, 2020, at 9:23 AM, nicola varini <nicola.varini at gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> Hello, I would like to know if it is possible to overlap CPU and
>>>>> GPU with DMDA.
>>>>> >> I've a machine where each node has 1P100+1Haswell.
>>>>> >> I've to resolve Poisson and Ampere equation for each time step.
>>>>> >> I'm using 2D DMDA for each of them. Would be possible to compute
>>>>> poisson
>>>>> >> and ampere equation at the same time? One on CPU and the other on
>>>>> GPU?
>>>>> >>
>>>>> >> Thanks
>>>>>
>>>> <out_gpu><out_nogpu>
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200812/2d700d06/attachment.html>