[petsc-users] Proper GPU usage in PETSc

Thu Sep 24 13:15:19 CDT 2020

On Thu, Sep 24, 2020 at 1:26 PM Zhang, Chonglin <zhangc20 at rpi.edu> wrote:

>
>
> On Sep 24, 2020, at 1:11 PM, Barry Smith <bsmith at petsc.dev> wrote:
>
>
>
> On Sep 24, 2020, at 11:48 AM, Zhang, Chonglin <zhangc20 at rpi.edu> wrote:
>
> Thanks Mark and Barry!
>
> A quick try of using “-pc_type jacobi” did reduce the number of count for
> “CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did
> not decrease in this case) solves the problem faster (may not be of any
> meaning since the problem size is too small; the function
> “DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size
> preventing running larger problems, separate issue).
>
> Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a
> significant amount of time for a realistic sized problem, say for example a
> linear problem with ~1-2 million DOFs?
>
>
>    It depends on how often the copies are done.
>
>    With GAMG once the preconditioner is built the entire linear solve can
> run on the GPU and Mark has some good speed ups of the liner solve using
> GAMG on the GPU instead of the CPU on Summit.
>
>    The speedup of the entire simulation will depend on the relative cost
> of the finite element matrix assembly vs the linear solver time and
> Amdahl's law kicks in so, for example, if the finite element assembly takes
> 50 percent of the time even if the linear solve takes 0 time one cannot
> only get a speedup of two which is not much.
>
>
> Thanks for the detailed explanation Barry!
>
> Mark: could you share the results of GAMG on GPU vs CPU on Summit, or
> pointing to me where I could see them. (Actual code how you are doing this
> would be even better as a learning opportunity for me). Thanks!
>

Here are a few plots. These use snes/ex13. I am actually testing that now
after many months and its not quite right, but here is how its run on
SUMMIT:

12:46 master= ~/petsc/src/snes/tutorials$ make
PETSC_DIR=/ccs/home/adams/petsc PETSC_ARCH=arch-summit-opt-gnu-cuda-omp -f
mymake run NP=4 EXTRA='-dm_refine 2 -dm_view'
jsrun -n 1 -a 4 -c 4 -g 1 -r 1 --smpiargs "-gpu" ./ex13 -dm_plex_box_dim 3
-dm_plex_box_simplex 0 -potential_petscspace_degree 1 -dm_refine 1
-ksp_type cg -ksp_rtol 1.e-11 -ksp_norm_type unpreconditioned -pc_type gamg
-pc_gamg_type agg -pc_gamg_agg_nsmooths 1 -pc_gamg_coarse_eq_limit 1000
-pc_gamg_square_graph 1 -pc_gamg_threshold 0.05 -pc_gamg_threshold_scale .0
-mg_levels_ksp_type chebyshev -mg_levels_ksp_max_it 1
-mg_levels_esteig_ksp_type cg -mg_levels_esteig_ksp_max_it 10
-mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05 -mg_levels_pc_type jacobi
-ksp_monitor -dm_refine 2 -dm_view -vec_type cuda -mat_type aijcusparse
DM Object: Mesh 4 MPI processes
  type: plex
Mesh in 3 dimensions:
  0-cells: 125 0 0 0
  1-cells: 300 0 0 0
  2-cells: 240 0 0 0
  3-cells: 64 0 0 0
Labels:
  celltype: 4 strata with value/size (0 (125), 1 (300), 4 (240), 7 (64))
  depth: 4 strata with value/size (0 (125), 1 (300), 2 (240), 3 (64))
  marker: 1 strata with value/size (1 (378))
  Face Sets: 6 strata with value/size (6 (49), 5 (49), 3 (49), 4 (49), 1
(49), 2 (49))
    0 KSP Residual norm 4.381403663299e+00
    1 KSP Residual norm 5.547944212592e-01
    2 KSP Residual norm 8.282166658954e-02
    3 KSP Residual norm 6.363083491776e-03
    4 KSP Residual norm 1.177626983807e-03
    5 KSP Residual norm 2.448983144610e-04
    6 KSP Residual norm 4.343194953221e-05
    7 KSP Residual norm 2.341208781337e-18

>
>
> Also, is there any plan to have the SNES and DMPlex code run on GPU?
>
>
>   Basically the finite element computation for the nonlinear function and
> its Jacobian need to run on the GPU, this is a big project that we've
> barely begun thinking about. If this is something you are interested in it
> would be fantastic if you could take a look at that.
>
>
> I see. I will think about this, discuss internally and get back to you if
> I can!
>
> Thanks!
> Chonglin
>
>
>   Barry
>
>
>
>
> Thanks!
> Chonglin
>
> On Sep 24, 2020, at 12:17 PM, Barry Smith <bsmith at petsc.dev> wrote:
>
>
>    MatSOR() runs on the CPU, this causes copy to CPU for each application
> of MatSOR() and then a copy to GPU for the next step.
>
>    You can try, for example -pc_type jacobi  better yet use PCGAMG if it
> amenable for your problem.
>
>    Also the problem is way to small for a GPU.
>
>   There will be copies between the GPU/CPU for each SNES iteration since
> the DMPLEX code does not run on GPUs.
>
>    Barry
>
>
>
> On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin <zhangc20 at rpi.edu> wrote:
>
> Dear PETSc Users,
>
> I have some questions regarding the proper GPU usage. I would like to know
> the proper way to:
> (1) solve linear equation in SNES, using GPU in PETSc; what
> syntax/arguments should I be using;
> (2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data
> transfer showed in PETSc log file, when using CUDA aware MPI.
>
>
> Details of what I am doing now and my observations are below:
>
> System and compilers used:
> (1) RPI’s AiMOS computer (node wise, it is the same as Summit);
> (2) using GCC 7.4.0 and Spectrum-MPI 10.3.
>
> I am doing the followings to solve the linear Poisson equation with SNES
> interface, under DMPlex:
> (1) using DMPlex to set up the unstructured mesh;
> (2) using DM to create vector and matrix;
> (3) using SNES interface to solve the linear Poisson equation, with
> “-snes_type ksponly”;
> (4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector
> and matrix, as suggested in this webpage:
> https://www.mcs.anl.gov/petsc/features/gpus.html
> (5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to
> enable GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit):
> https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct;
> https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf
> (6) using “-options_left” to check and make sure all the arguments are
> accepted and used by PETSc.
> (7) After problem setup, I am running the “SNESSolve()” multiple times to
> solve the linear problem and observe the log file with “-log_view"
>
> I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the
> “CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for
> some of the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo,
> VecCUDACopyFrom, MatCUSPARSCopyTo. See below for a truncated log
> corresponding to running SNESSolve() 500 times:
>
>
> Event                Count      Time (sec)     Flop
>       --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
> GpuToCpu - GPU
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count
>   Size  %F
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSided        510 1.0 4.9205e-03 1.1 0.00e+00 0.0 3.5e+01 4.0e+00
> 1.0e+03  0  0  0  0  0   0  0 21  0  0     0       0      0 0.00e+00    0
> 0.00e+00  0
> BuildTwoSidedF       501 1.0 1.0199e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+03  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
> 0.00e+00  0
> SNESSolve            500 1.0 3.2570e+02 1.0 1.18e+10 1.0 0.0e+00 0.0e+00
> 8.7e+05100100  0  0100 100100  0  0100   144     202   31947 7.82e+02 63363
> 1.44e+03 82
> SNESSetUp              1 1.0 6.0082e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
> 0.00e+00  0
> SNESFunctionEval     500 1.0 3.9826e+01 1.0 3.60e+08 1.0 0.0e+00 0.0e+00
> 5.0e+02 12  3  0  0  0  12  3  0  0  0    36      13      0 0.00e+00 1000
> 2.48e+01  0
> SNESJacobianEval     500 1.0 4.8200e+01 1.0 5.97e+08 1.0 0.0e+00 0.0e+00
> 2.0e+03 15  5  0  0  0  15  5  0  0  0    50       0   1000 7.77e+01  500
> 1.24e+01  0
> DMPlexResidualFE     500 1.0 3.6923e+01 1.1 3.56e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 10  3  0  0  0  10  3  0  0  0    39       0      0 0.00e+00  500
> 1.24e+01  0
> DMPlexJacobianFE     500 1.0 4.6013e+01 1.0 5.95e+08 1.0 0.0e+00 0.0e+00
> 2.0e+03 14  5  0  0  0  14  5  0  0  0    52       0   1000 7.77e+01    0
> 0.00e+00  0
> MatSOR             30947 1.0 3.1254e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  1 10  0  0  0   1 10  0  0  0  1542       0      0 0.00e+00 61863
> 1.41e+03  0
> MatAssemblyBegin     511 1.0 5.3428e+00256.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+03  1  0  0  0  0   1  0  0  0  0     0       0      0 0.00e+00    0
> 0.00e+00  0
> MatAssemblyEnd       511 1.0 4.3440e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.1e+01  0  0  0  0  0   0  0  0  0  0     0       0   1002 7.80e+01    0
> 0.00e+00  0
> MatCUSPARSCopyTo    1002 1.0 3.6557e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0   1002 7.80e+01    0
> 0.00e+00  0
> VecMDot            29930 1.0 3.7843e+01 1.0 2.62e+09 1.0 0.0e+00 0.0e+00
> 6.0e+04 12 22  0  0  7  12 22  0  0  7   277    3236   29930 6.81e+02    0
> 0.00e+00 100
> VecNorm            31447 1.0 2.1164e+01 1.4 1.79e+08 1.0 0.0e+00 0.0e+00
> 6.3e+04  5  2  0  0  7   5  2  0  0  7    34      55   1017 2.31e+01    0
> 0.00e+00 100
> VecNormalize       30947 1.0 2.3957e+01 1.1 2.65e+08 1.0 0.0e+00 0.0e+00
> 6.2e+04  7  2  0  0  7   7  2  0  0  7    44      51   1017 2.31e+01    0
> 0.00e+00 100
> VecCUDACopyTo      30947 1.0 7.8866e+00 3.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0       0   30947 7.04e+02    0
> 0.00e+00  0
> VecCUDACopyFrom    63363 1.0 1.0873e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00 63363
> 1.44e+03  0
> KSPSetUp             500 1.0 2.2737e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 5.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
> 0.00e+00  0
> KSPSolve             500 1.0 2.3687e+02 1.0 1.08e+10 1.0 0.0e+00 0.0e+00
> 8.6e+05 72 92  0  0 99  73 92  0  0 99   182     202   30947 7.04e+02 61863
> 1.41e+03 89
> KSPGMRESOrthog     29930 1.0 1.8920e+02 1.0 7.87e+09 1.0 0.0e+00 0.0e+00
> 6.4e+05 58 67  0  0 74  58 67  0  0 74   166     209   29930 6.81e+02    0
> 0.00e+00 100
> PCApply            30947 1.0 3.1555e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  1 10  0  0  0   1 10  0  0  0  1527       0      0 0.00e+00 61863
> 1.41e+03  0
>
>
> Thanks!
> Chonglin
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200924/7067d4f8/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: weak_scaling_cpu_a12g3r2.png
Type: image/png
Size: 67697 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200924/7067d4f8/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: weak_scaling_gpu_a12g3r2.png
Type: image/png
Size: 66173 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200924/7067d4f8/attachment-0003.png>