[petsc-users] overlap cpu and gpu?

Mon Aug 17 09:33:00 CDT 2020

Hi Mark, this is the out of grep GAMG after I used -info:
=======
[0] PCSetUp_GAMG(): level 0) N=582736, n data rows=1, n data cols=1,
nnz/row (ave)=9, np=12
[0] PCGAMGFilterGraph():         97.9676% nnz after filtering, with
threshold 0., 8.95768 nnz ave. (N=582736)
[0] PCGAMGCoarsen_AGG(): Square Graph on level 1 of 1 to square
[0] PCGAMGProlongator_AGG(): New grid 38934 nodes
[0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=2.101683e+00
min=4.341777e-03 PC=jacobi
[0] PCGAMGOptProlongator_AGG(): Smooth P0: level 0, cache spectra
0.00434178 2.10168
[0] PCSetUp_GAMG(): 1) N=38934, n data cols=1, nnz/row (ave)=18, 12 active
pes
[0] PCGAMGFilterGraph():         97.024% nnz after filtering, with
threshold 0., 17.9774 nnz ave. (N=38934)
[0] PCGAMGProlongator_AGG(): New grid 4459 nodes
[0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=4.521607e+01
min=5.854294e-01 PC=jacobi
[0] PCGAMGOptProlongator_AGG(): Smooth P0: level 1, cache spectra 0.585429
45.2161
[0] PCSetUp_GAMG(): 2) N=4459, n data cols=1, nnz/row (ave)=29, 12 active
pes
[0] PCGAMGFilterGraph():         99.6422% nnz after filtering, with
threshold 0., 27.5481 nnz ave. (N=4459)
[0] PCGAMGProlongator_AGG(): New grid 345 nodes
[0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=1.394069e+01
min=1.086973e-01 PC=jacobi
[0] PCGAMGOptProlongator_AGG(): Smooth P0: level 2, cache spectra 0.108697
13.9407
[0] PCGAMGCreateLevel_GAMG(): Number of equations (loc) 29 with simple
aggregation
[0] PCSetUp_GAMG(): 3) N=345, n data cols=1, nnz/row (ave)=31, 6 active pes
[0] PCGAMGFilterGraph():         99.6292% nnz after filtering, with
threshold 0., 26.9667 nnz ave. (N=345)
[0] PCGAMGProlongator_AGG(): New grid 26 nodes
[0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=1.463593e+02
min=1.469384e-01 PC=jacobi
[0] PCGAMGOptProlongator_AGG(): Smooth P0: level 3, cache spectra 0.146938
146.359
[0] PCGAMGCreateLevel_GAMG(): Number of equations (loc) 5 with simple
aggregation
[0] PCSetUp_GAMG(): 4) N=26, n data cols=1, nnz/row (ave)=16, 1 active pes
[0] PCSetUp_GAMG(): 5 levels, grid complexity = 1.16304
PCGAMGGraph_AGG        4 1.0 8.4114e-02 1.0 1.02e+06 1.0 3.8e+02 1.3e+03
4.0e+01  0  0  0  0  0   0  0  0  0  0   145
PCGAMGCoarse_AGG       4 1.0 3.2107e-01 1.0 9.43e+06 1.0 7.3e+02 1.1e+04
3.5e+01  0  0  0  0  0   0  0  0  0  0   351
PCGAMGProl_AGG         4 1.0 2.8825e-02 1.0 0.00e+00 0.0 3.5e+02 2.8e+03
6.4e+01  0  0  0  0  0   0  0  0  0  0     0
PCGAMGPOpt_AGG         4 1.0 1.1570e-01 1.0 2.61e+07 1.0 1.2e+03 2.6e+03
1.6e+02  0  0  0  0  1   0  0  0  0  1  2692
GAMG: createProl       4 1.0 5.5680e-01 1.0 3.64e+07 1.0 2.7e+03 4.6e+03
3.0e+02  0  0  0  0  1   0  0  0  0  1   784
GAMG: partLevel        4 1.0 1.1628e-01 1.0 5.90e+06 1.0 1.1e+03 3.0e+03
1.6e+02  0  0  0  0  1   0  0  0  0  1   604
======
Nicola

Il giorno lun 17 ago 2020 alle ore 15:40 Mark Adams <mfadams at lbl.gov> ha
scritto:

>
>
> On Mon, Aug 17, 2020 at 9:24 AM nicola varini <nicola.varini at gmail.com>
> wrote:
>
>> Hi Mark, I do confirm that hypre with boomeramg is working fine and is
>> pretty fast.
>>
>
> Good, you can send me the -info (grep GAMG) output and I try to see what
> is going on.
>
>
>> However, none of the GAMG option works.
>> Did anyone ever succeeded in usign hypre with petsc on gpu?
>>
>
> We have gotten Hypre to run on GPUs but it has been fragile. The
> performance has been marginal (due to use of USM apparently), but it is
> being worked on by the hypre team.
>
> The cude tools are changing fast and I am guessing this is a different
> version than what we have tested, perhaps. Maybe someone else can help with
> this, but I know we use cuda 10.2 and you are using cuda tools 10.1.
>
> And you do want to use the most up-to-date PETSc.
>
>
>> I did manage to compile hypre on gpu but I do get the following error:
>> =======
>>           CC gpuhypre/obj/vec/vec/impls/hypre/vhyp.o
>> In file included from
>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:22,
>>                  from
>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23,
>>                  from
>> /users/nvarini/hypre/include/_hypre_utilities.h:1129,
>>                  from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14,
>>                  from
>> /scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6,
>>                  from
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7:
>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:83:1:
>> error: unknown type name 'namespace'
>>  namespace thrust
>>  ^~~~~~~~~
>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:84:1:
>> error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
>>  {
>>  ^
>> In file included from
>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/config.h:28,
>>                  from
>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:23,
>>                  from
>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23,
>>                  from
>> /users/nvarini/hypre/include/_hypre_utilities.h:1129,
>>                  from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14,
>>                  from
>> /scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6,
>>                  from
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7:
>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/cpp_compatibility.h:21:10:
>> fatal error: cstddef: No such file or directory
>>  #include <cstddef>
>>           ^~~~~~~~~
>> compilation terminated.
>>
>> =======
>> Nicola
>>
>> Il giorno ven 14 ago 2020 alle ore 20:13 Mark Adams <mfadams at lbl.gov> ha
>> scritto:
>>
>>> You can try Hypre. If that fails then there is a problem with your
>>> system.
>>>
>>> And you can run with -info and grep on GAMG and send the output and I
>>> can see if I see anything funny.
>>>
>>> If this is just a Lapacian with a stable discretization and not crazy
>>> material parameters then stretched grids are about the only thing that can
>>> hurt the solver.
>>>
>>> Do both of your solves fail in a similar way?
>>>
>>> On the CPU you can try this with large subdomains, preferably (in serial
>>> ideally):
>>> -ampere_mg_levels_ksp_type richardson
>>> -ampere_mg_levels_pc_type sor
>>>
>>> And check that there are no unused options with -options_left. GAMG can
>>> fail with bad eigen estimates, but these parameters look fine.
>>>
>>> On Fri, Aug 14, 2020 at 5:01 AM nicola varini <nicola.varini at gmail.com>
>>> wrote:
>>>
>>>> Dear Barry, yes it gives the same problems.
>>>>
>>>> Il giorno gio 13 ago 2020 alle ore 23:22 Barry Smith <bsmith at petsc.dev>
>>>> ha scritto:
>>>>
>>>>>
>>>>>    Does the same thing work (with GAMG) if you run on the same problem
>>>>> on the same machine same number of MPI ranks but make a new PETSC_ARCH that
>>>>> does NOT use the GPUs?
>>>>>
>>>>>    Barry
>>>>>
>>>>>    Ideally one gets almost identical convergence with CPUs or GPUs
>>>>> (same problem, same machine) but a bug or numerically change "might" affect
>>>>> this.
>>>>>
>>>>> On Aug 13, 2020, at 10:28 AM, nicola varini <nicola.varini at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Barry, you are right. The Cray argument checking is incorrect. It
>>>>> does work with download-fblaslapack.
>>>>> However it does fail to converge. Is there anything obviously wrong
>>>>> with my petscrc?
>>>>> Anything else am I missing?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Il giorno gio 13 ago 2020 alle ore 03:17 Barry Smith <bsmith at petsc.dev>
>>>>> ha scritto:
>>>>>
>>>>>>
>>>>>>    The QR is always done on the CPU, we don't have generic calls to
>>>>>> blas/lapack go to the GPU currently.
>>>>>>
>>>>>>    The error message is:
>>>>>>
>>>>>>    On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value
>>>>>> (info = -7)
>>>>>>
>>>>>>    argument 7 is &LWORK which is defined by
>>>>>>
>>>>>>    PetscBLASInt   LWORK=N*bs;
>>>>>>
>>>>>>    and
>>>>>>
>>>>>>    N=nSAvec is the column block size of new P.
>>>>>>
>>>>>>    Presumably this is a huge run with many processes so using the
>>>>>> debugger is not practical?
>>>>>>
>>>>>>    We need to see what these variables are
>>>>>>
>>>>>>     N, bs, nSAvec
>>>>>>
>>>>>>     perhaps nSAvec is zero which could easily upset LAPACK.
>>>>>>
>>>>>>     Crudest thing would be to just put a print statement in the code
>>>>>> before the LAPACK call of if they are called many times add an error check
>>>>>> like that
>>>>>>     generates an error if any of these three values are 0 (or
>>>>>> negative).
>>>>>>
>>>>>>    Barry
>>>>>>
>>>>>>
>>>>>>     It is not impossible that the Cray argument checking is incorrect
>>>>>> and the value passed in is fine. You can check this by using
>>>>>> --download-fblaslapack and see if the same or some other error comes up.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Aug 12, 2020, at 7:19 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>
>>>>>> Can you reproduce this on the CPU?
>>>>>> The QR factorization seems to be failing. That could be from bad data
>>>>>> or a bad GPU QR.
>>>>>>
>>>>>> On Wed, Aug 12, 2020 at 4:19 AM nicola varini <
>>>>>> nicola.varini at gmail.com> wrote:
>>>>>>
>>>>>>> Dear all, following the suggestions I did resubmit the simulation
>>>>>>> with the petscrc below.
>>>>>>> However I do get the following error:
>>>>>>> ========
>>>>>>>  7362 [592]PETSC ERROR: #1 formProl0() line 748 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>>   7363 [339]PETSC ERROR: Petsc has generated inconsistent data
>>>>>>>   7364 [339]PETSC ERROR: xGEQRF error
>>>>>>>   7365 [339]PETSC ERROR: See
>>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>>>>> shooting.
>>>>>>>   7366 [339]PETSC ERROR: Petsc Release Version 3.13.3, Jul 01, 2020
>>>>>>>   7367 [339]PETSC ERROR:
>>>>>>> /users/nvarini/gbs_test_nicola/bin/gbs_daint_gpu_gnu on a  named nid05083
>>>>>>> by nvarini Wed Aug 12 10:06:15 2020
>>>>>>>   7368 [339]PETSC ERROR: Configure options --with-cc=cc
>>>>>>> --with-fc=ftn --known-mpi-shared-libraries=1 --known-mpi-c-double-complex=1
>>>>>>> --known-mpi-int64_t=1 --known-mpi-long-double=1 --with-batch=1
>>>>>>> --known-64-bit-blas-indices=0 --LIBS=-lstdc++ --with-cxxlib-autodetect=0
>>>>>>> --with-scalapa       ck=1 --with-cxx=CC --with-debugging=0
>>>>>>> --with-hypre-dir=/opt/cray/pe/tpsl/19.06.1/GNU/8.2/haswell
>>>>>>> --prefix=/scratch/snx3000/nvarini/petsc3.13.3-gpu --with-cuda=1
>>>>>>> --with-cuda-c=nvcc --with-cxxlib-autodetect=0
>>>>>>> --COPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include -
>>>>>>> -with-cxx=CC
>>>>>>> --CXXOPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include
>>>>>>>   7369 [592]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>>   7370 [592]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>>>   7371 [592]PETSC ERROR: #4 PCSetUp() line 898 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>>>>>>   7372 [592]PETSC ERROR: #5 KSPSetUp() line 376 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>   7373 [592]PETSC ERROR: #6 KSPSolve_Private() line 633 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>   7374 [316]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>>>   7375 [339]PETSC ERROR: #1 formProl0() line 748 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>>   7376 [339]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>>   7377 [339]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>>>   7378 [339]PETSC ERROR: #4 PCSetUp() line 898 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>>>>>>   7379 [339]PETSC ERROR: #5 KSPSetUp() line 376 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>   7380 [592]PETSC ERROR: #7 KSPSolve() line 853 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>   7381 [339]PETSC ERROR: #6 KSPSolve_Private() line 633 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>   7382 [339]PETSC ERROR: #7 KSPSolve() line 853 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>   7383 On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal
>>>>>>> value (info = -7)
>>>>>>>   7384 [160]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>>> ========
>>>>>>>
>>>>>>> I did try other pc_gamg_type but they fails as well.
>>>>>>>
>>>>>>>
>>>>>>> #PETSc Option Table entries:
>>>>>>> -ampere_dm_mat_type aijcusparse
>>>>>>> -ampere_dm_vec_type cuda
>>>>>>> -ampere_ksp_atol 1e-15
>>>>>>> -ampere_ksp_initial_guess_nonzero yes
>>>>>>> -ampere_ksp_reuse_preconditioner yes
>>>>>>> -ampere_ksp_rtol 1e-7
>>>>>>> -ampere_ksp_type dgmres
>>>>>>> -ampere_mg_levels_esteig_ksp_max_it 10
>>>>>>> -ampere_mg_levels_esteig_ksp_type cg
>>>>>>> -ampere_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>>>>>>> -ampere_mg_levels_ksp_type chebyshev
>>>>>>> -ampere_mg_levels_pc_type jacobi
>>>>>>> -ampere_pc_gamg_agg_nsmooths 1
>>>>>>> -ampere_pc_gamg_coarse_eq_limit 10
>>>>>>> -ampere_pc_gamg_reuse_interpolation true
>>>>>>> -ampere_pc_gamg_square_graph 1
>>>>>>> -ampere_pc_gamg_threshold 0.05
>>>>>>> -ampere_pc_gamg_threshold_scale .0
>>>>>>> -ampere_pc_gamg_type agg
>>>>>>> -ampere_pc_type gamg
>>>>>>> -dm_mat_type aijcusparse
>>>>>>> -dm_vec_type cuda
>>>>>>> -log_view
>>>>>>> -poisson_dm_mat_type aijcusparse
>>>>>>> -poisson_dm_vec_type cuda
>>>>>>> -poisson_ksp_atol 1e-15
>>>>>>> -poisson_ksp_initial_guess_nonzero yes
>>>>>>> -poisson_ksp_reuse_preconditioner yes
>>>>>>> -poisson_ksp_rtol 1e-7
>>>>>>> -poisson_ksp_type dgmres
>>>>>>> -poisson_log_view
>>>>>>> -poisson_mg_levels_esteig_ksp_max_it 10
>>>>>>> -poisson_mg_levels_esteig_ksp_type cg
>>>>>>> -poisson_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>>>>>>> -poisson_mg_levels_ksp_max_it 1
>>>>>>> -poisson_mg_levels_ksp_type chebyshev
>>>>>>> -poisson_mg_levels_pc_type jacobi
>>>>>>> -poisson_pc_gamg_agg_nsmooths 1
>>>>>>> -poisson_pc_gamg_coarse_eq_limit 10
>>>>>>> -poisson_pc_gamg_reuse_interpolation true
>>>>>>> -poisson_pc_gamg_square_graph 1
>>>>>>> -poisson_pc_gamg_threshold 0.05
>>>>>>> -poisson_pc_gamg_threshold_scale .0
>>>>>>> -poisson_pc_gamg_type agg
>>>>>>> -poisson_pc_type gamg
>>>>>>> -use_mat_nearnullspace true
>>>>>>> #End of PETSc Option Table entries
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Nicola
>>>>>>>
>>>>>>> Il giorno mar 4 ago 2020 alle ore 17:57 Mark Adams <mfadams at lbl.gov>
>>>>>>> ha scritto:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 4, 2020 at 6:35 AM Stefano Zampini <
>>>>>>>> stefano.zampini at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Nicola,
>>>>>>>>>
>>>>>>>>> You are actually not using the GPU properly, since you use HYPRE
>>>>>>>>> preconditioning, which is CPU only. One of your solvers is actually slower
>>>>>>>>> on “GPU”.
>>>>>>>>> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and
>>>>>>>>> with Jacobi preconditioning. Mark can help you out with the specific
>>>>>>>>> command line options.
>>>>>>>>> When it works properly, everything related to PC application is
>>>>>>>>> offloaded to the GPU, and you should expect to get the well-known and
>>>>>>>>> branded 10x (maybe more) speedup one is expecting from GPUs during KSPSolve
>>>>>>>>>
>>>>>>>>>
>>>>>>>> The speedup depends on the machine, but on SUMMIT, using enough
>>>>>>>> CPUs to saturate the memory bus vs all 6 GPUs the speedup is a function of
>>>>>>>> problem subdomain size. I saw 10x at about 100K equations/process.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Doing what you want to do is one of the last optimization steps of
>>>>>>>>> an already optimized code before entering production. Yours is not even
>>>>>>>>> optimized for proper GPU usage  yet.
>>>>>>>>> Also, any specific reason why you are using dgmres and fgmres?
>>>>>>>>>
>>>>>>>>> PETSc has not been designed with multi-threading in mind. You can
>>>>>>>>> achieve “overlap” of the two solves by splitting the communicator. But then
>>>>>>>>> you need communications to let the two solutions talk to each other.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 4, 2020, at 12:04 PM, nicola varini <
>>>>>>>>> nicola.varini at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Dear all, thanks for your replies. The reason why I've asked if it
>>>>>>>>> is possible to overlap poisson and ampere is because they roughly
>>>>>>>>> take the same amount of time. Please find in attachment the
>>>>>>>>> profiling logs for only CPU  and only GPU.
>>>>>>>>> Of course it is possible to split the MPI communicator and run
>>>>>>>>> each solver on different subcommunicator, however this would involve more
>>>>>>>>> communication.
>>>>>>>>> Did anyone ever tried to run 2 solvers with hyperthreading?
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <
>>>>>>>>> mfadams at lbl.gov> ha scritto:
>>>>>>>>>
>>>>>>>>>> I suspect that the Poisson and Ampere's law solve are not
>>>>>>>>>> coupled. You might be able to duplicate the communicator and use two
>>>>>>>>>> threads. You would want to configure PETSc with threadsafty and threads and
>>>>>>>>>> I think it could/should work, but this mode is never used by anyone.
>>>>>>>>>>
>>>>>>>>>> That said, I would not recommend doing this unless you feel like
>>>>>>>>>> playing in computer science, as opposed to doing application science. The
>>>>>>>>>> best case scenario you get a speedup of 2x. That is a strict upper bound,
>>>>>>>>>> but you will never come close to it. Your hardware has some balance of CPU
>>>>>>>>>> to GPU processing rate. Your application has a balance of volume of work
>>>>>>>>>> for your two solves. They have to be the same to get close to 2x speedup
>>>>>>>>>> and that ratio(s) has to be 1:1. To be concrete, from what little I can
>>>>>>>>>> guess about your applications let's assume that the cost of each of these
>>>>>>>>>> two solves is about the same (eg, Laplacians on your domain and the best
>>>>>>>>>> case scenario). But, GPU machines are configured to have roughly 1-10% of
>>>>>>>>>> capacity in the GPUs, these days, that gives you an upper bound of about
>>>>>>>>>> 10% speedup. That is noise. Upshot, unless you configure your hardware to
>>>>>>>>>> match this problem, and the two solves have the same cost, you will not see
>>>>>>>>>> close to 2x speedup. Your time is better spent elsewhere.
>>>>>>>>>>
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <jed at jedbrown.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> You can use MPI and split the communicator so n-1 ranks create a
>>>>>>>>>>> DMDA for one part of your system and the other rank drives the GPU in the
>>>>>>>>>>> other part.  They can all be part of the same coupled system on the full
>>>>>>>>>>> communicator, but PETSc doesn't currently support some ranks having their
>>>>>>>>>>> Vec arrays on GPU and others on host, so you'd be paying host-device
>>>>>>>>>>> transfer costs on each iteration (and that might swamp any performance
>>>>>>>>>>> benefit you would have gotten).
>>>>>>>>>>>
>>>>>>>>>>> In any case, be sure to think about the execution time of each
>>>>>>>>>>> part.  Load balancing with matching time-to-solution for each part can be
>>>>>>>>>>> really hard.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Barry Smith <bsmith at petsc.dev> writes:
>>>>>>>>>>>
>>>>>>>>>>> >   Nicola,
>>>>>>>>>>> >
>>>>>>>>>>> >     This is really viable or practical at this time with
>>>>>>>>>>> PETSc. It is not impossible but requires careful coding with threads,
>>>>>>>>>>> another possibility is to use one half of the virtual GPUs for each solve,
>>>>>>>>>>> this is also not trivial. I would recommend first seeing what kind of
>>>>>>>>>>> performance you can get on the GPU for each type of solve and revist this
>>>>>>>>>>> idea in the future.
>>>>>>>>>>> >
>>>>>>>>>>> >    Barry
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >> On Jul 31, 2020, at 9:23 AM, nicola varini <
>>>>>>>>>>> nicola.varini at gmail.com> wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >> Hello, I would like to know if it is possible to overlap CPU
>>>>>>>>>>> and GPU with DMDA.
>>>>>>>>>>> >> I've a machine where each node has 1P100+1Haswell.
>>>>>>>>>>> >> I've to resolve Poisson and Ampere equation for each time
>>>>>>>>>>> step.
>>>>>>>>>>> >> I'm using 2D DMDA for each of them. Would be possible to
>>>>>>>>>>> compute poisson
>>>>>>>>>>> >> and ampere equation at the same time? One on CPU and the
>>>>>>>>>>> other on GPU?
>>>>>>>>>>> >>
>>>>>>>>>>> >> Thanks
>>>>>>>>>>>
>>>>>>>>>> <out_gpu><out_nogpu>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>> <out_miniapp_f_poisson>
>>>>>
>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200817/d1c018c1/attachment-0001.html>