<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div class=""><br class="">
</div>
<div class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Sep 24, 2020, at 1:11 PM, Barry Smith <<a href="mailto:bsmith@petsc.dev" class="">bsmith@petsc.dev</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<br class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Sep 24, 2020, at 11:48 AM, Zhang, Chonglin <<a href="mailto:zhangc20@rpi.edu" class="">zhangc20@rpi.edu</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
Thanks Mark and Barry!
<div class=""><br class="">
</div>
<div class="">A quick try of using “-pc_type jacobi” did reduce the number of count for “CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did not decrease in this case) solves the problem faster (may not be of any meaning since the problem
size is too small; the function “DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size preventing running larger problems, separate issue).</div>
<div class=""><br class="">
</div>
<div class="">Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a significant amount of time for a realistic sized problem, say for example a linear problem with ~1-2 million DOFs?</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
It depends on how often the copies are done. </div>
<div class=""><br class="">
</div>
<div class=""> With GAMG once the preconditioner is built the entire linear solve can run on the GPU and Mark has some good speed ups of the liner solve using GAMG on the GPU instead of the CPU on Summit. </div>
<div class=""><br class="">
</div>
<div class=""> The speedup of the entire simulation will depend on the relative cost of the finite element matrix assembly vs the linear solver time and Amdahl's law kicks in so, for example, if the finite element assembly takes 50 percent of the time even
if the linear solve takes 0 time one cannot only get a speedup of two which is not much.</div>
<div class=""><br class="">
</div>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Thanks for the detailed explanation Barry! </div>
<div><br class="">
</div>
<div>Mark: could you share the results of GAMG on GPU vs CPU on Summit, or pointing to me where I could see them. (Actual code how you are doing this would be even better as a learning opportunity for me). Thanks!</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div class=""><br class="">
</div>
<div class="">Also, is there any plan to have the SNES and DMPlex code run on GPU?</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
Basically the finite element computation for the nonlinear function and its Jacobian need to run on the GPU, this is a big project that we've barely begun thinking about. If this is something you are interested in it would be fantastic if you could take a
look at that.</div>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>I see. I will think about this, discuss internally and get back to you if I can!</div>
<div><br class="">
</div>
<div>Thanks!</div>
<div>Chonglin</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div class=""><br class="">
</div>
<div class=""> Barry</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div class=""><br class="">
</div>
<div class="">Thanks!</div>
<div class="">Chonglin</div>
<div class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Sep 24, 2020, at 12:17 PM, Barry Smith <<a href="mailto:bsmith@petsc.dev" class="">bsmith@petsc.dev</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div class=""><br class="">
</div>
MatSOR() runs on the CPU, this causes copy to CPU for each application of MatSOR() and then a copy to GPU for the next step.
<div class=""><br class="">
</div>
<div class=""> You can try, for example -pc_type jacobi better yet use PCGAMG if it amenable for your problem.</div>
<div class=""><br class="">
</div>
<div class=""> Also the problem is way to small for a GPU.</div>
<div class=""><br class="">
</div>
<div class=""> There will be copies between the GPU/CPU for each SNES iteration since the DMPLEX code does not run on GPUs.</div>
<div class=""><br class="">
</div>
<div class=""> Barry</div>
<div class=""><br class="">
<div class="">
<div class=""><br class="">
</div>
<br class="">
<blockquote type="cite" class="">
<div class="">On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin <<a href="mailto:zhangc20@rpi.edu" class="">zhangc20@rpi.edu</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<div class="">Dear PETSc Users,</div>
<div class=""><br class="">
</div>
<div class="">I have some questions regarding the proper GPU usage. I would like to know the proper way to:</div>
<div class="">(1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments should I be using;</div>
<div class="">(2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer showed in PETSc log file, when using CUDA aware MPI.</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class="">Details of what I am doing now and my observations are below:</div>
<div class=""><br class="">
</div>
<div class="">System and compilers used:</div>
<div class="">(1) RPI’s AiMOS computer (node wise, it is the same as Summit);</div>
<div class="">(2) using GCC 7.4.0 and Spectrum-MPI 10.3.</div>
<div class=""><br class="">
</div>
<div class="">I am doing the followings to solve the linear Poisson equation with SNES interface, under DMPlex:</div>
<div class="">(1) using DMPlex to set up the unstructured mesh;</div>
<div class="">(2) using DM to create vector and matrix;</div>
<div class="">(3) using SNES interface to solve the linear Poisson equation, with “-snes_type ksponly”;</div>
<div class="">(4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector and matrix, as suggested in this webpage: <a href="https://www.mcs.anl.gov/petsc/features/gpus.html" class="">https://www.mcs.anl.gov/petsc/features/gpus.html</a></div>
<div class="">
<div class="">(5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to enable GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): <a href="https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct" class="">https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct</a>; <a href="https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf" class="">https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf</a><span class="Apple-tab-span" style="white-space:pre">
</span></div>
<div class="">(6) using “-options_left” to check and make sure all the arguments are accepted and used by PETSc.</div>
<div class="">(7) After problem setup, I am running the “SNESSolve()” multiple times to solve the linear problem and observe the log file with “-log_view"</div>
<div class=""><br class="">
</div>
<div class="">I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the “CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for some of the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, VecCUDACopyFrom, MatCUSPARSCopyTo.
See below for a truncated log corresponding to running SNESSolve() 500 times:</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class="">
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class=""> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">---------------------------------------------------------------------------------------------------------------------------------------------------------------</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo; min-height: 13px;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class=""></span><br class="">
</div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">--- Event Stage 0: Main Stage</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo; min-height: 13px;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class=""></span><br class="">
</div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">BuildTwoSided 510 1.0 4.9205e-03 1.1 0.00e+00 0.0 3.5e+01 4.0e+00 1.0e+03 0 0 0 0 0 0 0 21 0 0 0 0 0 0.00e+00 0 0.00e+00 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">BuildTwoSidedF 501 1.0 1.0199e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+03 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">SNESSolve 500 1.0 3.2570e+02 1.0 1.18e+10 1.0 0.0e+00 0.0e+00 8.7e+05100100 0 0100 100100 0 0100 144 202 31947 7.82e+02 63363 1.44e+03 82</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">SNESSetUp 1 1.0 6.0082e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">SNESFunctionEval 500 1.0 3.9826e+01 1.0 3.60e+08 1.0 0.0e+00 0.0e+00 5.0e+02 12 3 0 0 0 12 3 0 0 0 36 13 0 0.00e+00 1000 2.48e+01 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">SNESJacobianEval 500 1.0 4.8200e+01 1.0 5.97e+08 1.0 0.0e+00 0.0e+00 2.0e+03 15 5 0 0 0 15 5 0 0 0 50 0 1000 7.77e+01 500 1.24e+01 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">DMPlexResidualFE 500 1.0 3.6923e+01 1.1 3.56e+08 1.0 0.0e+00 0.0e+00 0.0e+00 10 3 0 0 0 10 3 0 0 0 39 0 0 0.00e+00 500 1.24e+01 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">DMPlexJacobianFE 500 1.0 4.6013e+01 1.0 5.95e+08 1.0 0.0e+00 0.0e+00 2.0e+03 14 5 0 0 0 14 5 0 0 0 52 0 1000 7.77e+01 0 0.00e+00 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">MatSOR 30947 1.0 3.1254e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00 0.0e+00 1 10 0 0 0 1 10 0 0 0 1542 0 0 0.00e+00 61863 1.41e+03 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">MatAssemblyBegin 511 1.0 5.3428e+00256.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+03 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">MatAssemblyEnd 511 1.0 4.3440e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 2.1e+01 0 0 0 0 0 0 0 0 0 0 0 0 1002 7.80e+01 0 0.00e+00 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
MatCUSPARSCopyTo 1002 1.0 3.6557e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1002 7.80e+01 0 0.00e+00 0</div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">VecMDot 29930 1.0 3.7843e+01 1.0 2.62e+09 1.0 0.0e+00 0.0e+00 6.0e+04 12 22 0 0 7 12 22 0 0 7 277 3236 29930 6.81e+02 0 0.00e+00 100</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">VecNorm 31447 1.0 2.1164e+01 1.4 1.79e+08 1.0 0.0e+00 0.0e+00 6.3e+04 5 2 0 0 7 5 2 0 0 7 34 55 1017 2.31e+01 0 0.00e+00 100</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">VecNormalize 30947 1.0 2.3957e+01 1.1 2.65e+08 1.0 0.0e+00 0.0e+00 6.2e+04 7 2 0 0 7 7 2 0 0 7 44 51 1017 2.31e+01 0 0.00e+00 100</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">VecCUDACopyTo 30947 1.0 7.8866e+00 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 0 30947 7.04e+02 0 0.00e+00 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">VecCUDACopyFrom 63363 1.0 1.0873e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 63363 1.44e+03 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">KSPSetUp 500 1.0 2.2737e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 5.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">KSPSolve 500 1.0 2.3687e+02 1.0 1.08e+10 1.0 0.0e+00 0.0e+00 8.6e+05 72 92 0 0 99 73 92 0 0 99 182 202 30947 7.04e+02 61863 1.41e+03 89</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">KSPGMRESOrthog 29930 1.0 1.8920e+02 1.0 7.87e+09 1.0 0.0e+00 0.0e+00 6.4e+05 58 67 0 0 74 58 67 0 0 74 166 209 29930 6.81e+02 0 0.00e+00 100</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
PCApply 30947 1.0 3.1555e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00 0.0e+00 1 10 0 0 0 1 10 0 0 0 1527 0 0 0.00e+00 61863 1.41e+03 0</div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class="">
<br class="">
</div>
<div style="margin: 0px; font-stretch: normal; line-height: normal;" class=""><br class="">
</div>
<div style="margin: 0px; font-stretch: normal; line-height: normal;" class="">Thanks!</div>
<div style="margin: 0px; font-stretch: normal; line-height: normal;" class="">Chonglin</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</body>
</html>