<div dir="ltr"><div dir="ltr">On Thu, Sep 24, 2020 at 3:08 PM Zhang, Chonglin <<a href="mailto:zhangc20@rpi.edu">zhangc20@rpi.edu</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div style="overflow-wrap: break-word;">
Hi Matt,
<div><br>
</div>
<div>I will quickly summarize what I found with “CreateMesh" for running ex12 here:
<a href="https://gitlab.com/petsc/petsc/-/blob/master/src/snes/tutorials/ex12.c" target="_blank">
https://gitlab.com/petsc/petsc/-/blob/master/src/snes/tutorials/ex12.c</a>. If this is not a proper threads to discuss this, I can open a new one.</div>
<div><br>
</div>
<div>Commands used (relevant to mesh creation) to run ex12 (quad core desktop computer with CPU only, 4 MPI ranks):</div>
<div>mpirun -np 4 -cells 100, 100, 0 -options_left -log_view</div>
<div>I built PETSc commit: 2bbfc05, dated Sep 23, 2020, with debug=no.</div>
<div><br>
</div>
<div>Mesh size CreateMesh (seconds) DMPlexDistribute (seconds)</div>
<div> 100 *100 0.14 0.081</div>
<div> 500 *500 2.28 1.33</div>
<div> 1000*1000 10.1 5.10</div>
<div> 2000*1000 24.6 10.96</div>
<div> 2000*2000 73.7 22.72</div>
<div><br>
</div>
<div>Is the performance reasonable for the “CreateMesh” functionality?</div>
<div><br>
</div>
<div>Anything I am not doing correctly with DMPlex running ex12?</div></div></blockquote><div><br></div><div>ex12 is a little old. I have been meaning to update it. ex13 does the same thing in a more modern way.</div><div><br></div><div>Above looks reasonable I think. The CreateMesh time includes generating the mesh using Triangle, since simplex is the</div><div>default. In example 12, you could use</div><div><br></div><div> -simplex 0</div><div><br></div><div>or in ex13</div><div><br></div><div> -dm_plex_box_simplex 0</div><div><br></div><div>to get hexes, which do not use a generator. Second, you are interpolating all on process 0, which is probably</div><div>the bulk of the time. I do that because I do not care about parallel performance in the examples and it is simpler.</div><div>You can also refine the mesh after distribution, which is faster, and cuts down on the distribution time. So if you</div><div>want the whole thing, you could use</div><div><br></div><div> DM odm, dm;</div><div><br></div><div> /* Create a cell-vertex box mesh */</div><div> ierr = DMPlexCreateBoxMesh(comm, 2, PETSC_TRUE, NULL, NULL, NULL, NULL, PETSC_FALSE, &odm);CHKERRQ(ierr);<br></div><div> ierr = PetscObjectSetOptionsPrefix((PetscObject) dm, "orig_");CHKERRQ(ierr);</div><div> /* Distributes the mesh here */</div><div> ierr = DMSetFromOptions(odm);CHKERRQ(ierr);</div><div> /* Interpolate the mesh */</div><div> ierr = DMPlexInterpolate(odm, &dm);CHKERRQ(ierr);</div><div> ierr = DMDestroy(&odm);CHKERRQ(ierr);</div><div> /* Refine the mesh */</div><div> ierr = DMSetFromOptions(dm);CHKERRQ(ierr);</div><div><br></div><div>and run with</div><div><br></div><div> -dm_plex_box_simplex 0 -dm_plex_box_faces 100,100 -orig_dm_distribute -dm_refine 3</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;">
<div>Thanks!</div>
<div>Chonglin</div>
<div>
<div><br>
</div>
<div>
<blockquote type="cite">
<div>On Sep 24, 2020, at 2:06 PM, Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:</div>
<br>
<div>
<div dir="ltr" style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">
<div dir="ltr">On Thu, Sep 24, 2020 at 2:04 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">On Thu, Sep 24, 2020 at 1:38 PM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">On Thu, Sep 24, 2020 at 12:48 PM Zhang, Chonglin <<a href="mailto:zhangc20@rpi.edu" target="_blank">zhangc20@rpi.edu</a>> wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>Thanks Mark and Barry!
<div><br>
</div>
<div>A quick try of using “-pc_type jacobi” did reduce the number of count for “CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did not decrease in this case) solves the problem faster (may not be of any meaning since the problem
size is too small; the function “DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size preventing running larger problems, separate issue).</div>
</div>
</blockquote>
<div><br>
</div>
<div>It sounds like something is wrong then, or I do not understand what you mean by slow.</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>sor may be the default so you need to set the -mg_level_ksp[pc]_type chebyshev[jacobi]. chebyshev is the ksp default.</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>I meant for the mesh creation.</div>
<div><br>
</div>
<div> Thanks,</div>
<div><br>
</div>
<div> Matt</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_quote">
<div> Thanks,</div>
<div><br>
</div>
<div> Matt</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div>Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a significant amount of time for a realistic sized problem, say for example a linear problem with ~1-2 million DOFs?</div>
<div><br>
</div>
<div>Also, is there any plan to have the SNES and DMPlex code run on GPU?</div>
<div><br>
</div>
<div>Thanks!</div>
<div>Chonglin</div>
<div>
<div><br>
<blockquote type="cite">
<div>On Sep 24, 2020, at 12:17 PM, Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:</div>
<br>
<div>
<div>
<div><br>
</div>
MatSOR() runs on the CPU, this causes copy to CPU for each application of MatSOR() and then a copy to GPU for the next step.
<div><br>
</div>
<div> You can try, for example -pc_type jacobi better yet use PCGAMG if it amenable for your problem.</div>
<div><br>
</div>
<div> Also the problem is way to small for a GPU.</div>
<div><br>
</div>
<div> There will be copies between the GPU/CPU for each SNES iteration since the DMPLEX code does not run on GPUs.</div>
<div><br>
</div>
<div> Barry</div>
<div><br>
<div>
<div><br>
</div>
<br>
<blockquote type="cite">
<div>On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin <<a href="mailto:zhangc20@rpi.edu" target="_blank">zhangc20@rpi.edu</a>> wrote:</div>
<br>
<div>
<div>
<div>Dear PETSc Users,</div>
<div><br>
</div>
<div>I have some questions regarding the proper GPU usage. I would like to know the proper way to:</div>
<div>(1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments should I be using;</div>
<div>(2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer showed in PETSc log file, when using CUDA aware MPI.</div>
<div><br>
</div>
<div><br>
</div>
<div>Details of what I am doing now and my observations are below:</div>
<div><br>
</div>
<div>System and compilers used:</div>
<div>(1) RPI’s AiMOS computer (node wise, it is the same as Summit);</div>
<div>(2) using GCC 7.4.0 and Spectrum-MPI 10.3.</div>
<div><br>
</div>
<div>I am doing the followings to solve the linear Poisson equation with SNES interface, under DMPlex:</div>
<div>(1) using DMPlex to set up the unstructured mesh;</div>
<div>(2) using DM to create vector and matrix;</div>
<div>(3) using SNES interface to solve the linear Poisson equation, with “-snes_type ksponly”;</div>
<div>(4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector and matrix, as suggested in this webpage: <a href="https://www.mcs.anl.gov/petsc/features/gpus.html" target="_blank">https://www.mcs.anl.gov/petsc/features/gpus.html</a></div>
<div>
<div>(5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to enable GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): <a href="https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct" target="_blank">https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct</a>; <a href="https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf" target="_blank">https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf</a><span style="white-space:pre-wrap">
</span></div>
<div>(6) using “-options_left” to check and make sure all the arguments are accepted and used by PETSc.</div>
<div>(7) After problem setup, I am running the “SNESSolve()” multiple times to solve the linear problem and observe the log file with “-log_view"</div>
<div><br>
</div>
<div>I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the “CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for some of the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, VecCUDACopyFrom, MatCUSPARSCopyTo.
See below for a truncated log corresponding to running SNESSolve() 500 times:</div>
<div><br>
</div>
<div><br>
</div>
<div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures"> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">---------------------------------------------------------------------------------------------------------------------------------------------------------------</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;min-height:13px">
<span style="font-variant-ligatures:no-common-ligatures"></span><br>
</div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">--- Event Stage 0: Main Stage</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;min-height:13px">
<span style="font-variant-ligatures:no-common-ligatures"></span><br>
</div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">BuildTwoSided 510 1.0 4.9205e-03 1.1 0.00e+00 0.0 3.5e+01 4.0e+00 1.0e+03 0 0 0 0 0 0 0 21 0 0 0 0 0 0.00e+00 0 0.00e+00 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">BuildTwoSidedF 501 1.0 1.0199e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+03 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">SNESSolve 500 1.0 3.2570e+02 1.0 1.18e+10 1.0 0.0e+00 0.0e+00 8.7e+05100100 0 0100 100100 0 0100 144 202 31947 7.82e+02 63363 1.44e+03 82</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">SNESSetUp 1 1.0 6.0082e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">SNESFunctionEval 500 1.0 3.9826e+01 1.0 3.60e+08 1.0 0.0e+00 0.0e+00 5.0e+02 12 3 0 0 0 12 3 0 0 0 36 13 0 0.00e+00 1000 2.48e+01 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">SNESJacobianEval 500 1.0 4.8200e+01 1.0 5.97e+08 1.0 0.0e+00 0.0e+00 2.0e+03 15 5 0 0 0 15 5 0 0 0 50 0 1000 7.77e+01 500 1.24e+01 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">DMPlexResidualFE 500 1.0 3.6923e+01 1.1 3.56e+08 1.0 0.0e+00 0.0e+00 0.0e+00 10 3 0 0 0 10 3 0 0 0 39 0 0 0.00e+00 500 1.24e+01 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">DMPlexJacobianFE 500 1.0 4.6013e+01 1.0 5.95e+08 1.0 0.0e+00 0.0e+00 2.0e+03 14 5 0 0 0 14 5 0 0 0 52 0 1000 7.77e+01 0 0.00e+00 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">MatSOR 30947 1.0 3.1254e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00 0.0e+00 1 10 0 0 0 1 10 0 0 0 1542 0 0 0.00e+00 61863 1.41e+03 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">MatAssemblyBegin 511 1.0 5.3428e+00256.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+03 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">MatAssemblyEnd 511 1.0 4.3440e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 2.1e+01 0 0 0 0 0 0 0 0 0 0 0 0 1002 7.80e+01 0 0.00e+00 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
MatCUSPARSCopyTo 1002 1.0 3.6557e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1002 7.80e+01 0 0.00e+00 0</div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">VecMDot 29930 1.0 3.7843e+01 1.0 2.62e+09 1.0 0.0e+00 0.0e+00 6.0e+04 12 22 0 0 7 12 22 0 0 7 277 3236 29930 6.81e+02 0 0.00e+00 100</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">VecNorm 31447 1.0 2.1164e+01 1.4 1.79e+08 1.0 0.0e+00 0.0e+00 6.3e+04 5 2 0 0 7 5 2 0 0 7 34 55 1017 2.31e+01 0 0.00e+00 100</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">VecNormalize 30947 1.0 2.3957e+01 1.1 2.65e+08 1.0 0.0e+00 0.0e+00 6.2e+04 7 2 0 0 7 7 2 0 0 7 44 51 1017 2.31e+01 0 0.00e+00 100</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">VecCUDACopyTo 30947 1.0 7.8866e+00 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 0 30947 7.04e+02 0 0.00e+00 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">VecCUDACopyFrom 63363 1.0 1.0873e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 63363 1.44e+03 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">KSPSetUp 500 1.0 2.2737e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 5.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">KSPSolve 500 1.0 2.3687e+02 1.0 1.08e+10 1.0 0.0e+00 0.0e+00 8.6e+05 72 92 0 0 99 73 92 0 0 99 182 202 30947 7.04e+02 61863 1.41e+03 89</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<span style="font-variant-ligatures:no-common-ligatures">KSPGMRESOrthog 29930 1.0 1.8920e+02 1.0 7.87e+09 1.0 0.0e+00 0.0e+00 6.4e+05 58 67 0 0 74 58 67 0 0 74 166 209 29930 6.81e+02 0 0.00e+00 100</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
PCApply 30947 1.0 3.1555e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00 0.0e+00 1 10 0 0 0 1 10 0 0 0 1527 0 0 0.00e+00 61863 1.41e+03 0</div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo">
<br>
</div>
<div style="margin:0px;font-stretch:normal;line-height:normal"><br>
</div>
<div style="margin:0px;font-stretch:normal;line-height:normal">Thanks!</div>
<div style="margin:0px;font-stretch:normal;line-height:normal">Chonglin</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
--<span> </span><br>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener</div>
<div><br>
</div>
<div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
--<span> </span><br>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener</div>
<div><br>
</div>
<div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>