[petsc-users] Proper GPU usage in PETSc
Matthew Knepley
knepley at gmail.com
Thu Sep 24 15:01:37 CDT 2020
On Thu, Sep 24, 2020 at 3:48 PM Zhang, Chonglin <zhangc20 at rpi.edu> wrote:
> Hi Matt,
>
> Thanks for the comments and the nice example code. Right now our objective
> is to use XGC unstructured flux-surface-following mesh (fixed in size), I
> will keep your comment on mesh refinement in mind.
>
Okay. If you load the mesh in parallel, then interpolation is done in
parallel too, so that is fine. The main time in parallel load is then the
redistribution of vertices. Vaclav wrote
a nice paper about it: https://arxiv.org/abs/2004.08729
Thanks,
Matt
> Thanks!
> Chonglin
>
> On Sep 24, 2020, at 3:26 PM, Matthew Knepley <knepley at gmail.com> wrote:
>
> On Thu, Sep 24, 2020 at 3:08 PM Zhang, Chonglin <zhangc20 at rpi.edu> wrote:
>
>> Hi Matt,
>>
>> I will quickly summarize what I found with “CreateMesh" for running ex12
>> here:
>> https://gitlab.com/petsc/petsc/-/blob/master/src/snes/tutorials/ex12.c.
>> If this is not a proper threads to discuss this, I can open a new one.
>>
>> Commands used (relevant to mesh creation) to run ex12 (quad core desktop
>> computer with CPU only, 4 MPI ranks):
>> mpirun -np 4 -cells 100, 100, 0 -options_left -log_view
>> I built PETSc commit: 2bbfc05, dated Sep 23, 2020, with debug=no.
>>
>> Mesh size CreateMesh (seconds) DMPlexDistribute (seconds)
>> 100 *100 0.14 0.081
>> 500 *500 2.28 1.33
>> 1000*1000 10.1 5.10
>> 2000*1000 24.6 10.96
>> 2000*2000 73.7 22.72
>>
>> Is the performance reasonable for the “CreateMesh” functionality?
>>
>> Anything I am not doing correctly with DMPlex running ex12?
>>
>
> ex12 is a little old. I have been meaning to update it. ex13 does the same
> thing in a more modern way.
>
> Above looks reasonable I think. The CreateMesh time includes generating
> the mesh using Triangle, since simplex is the
> default. In example 12, you could use
>
> -simplex 0
>
> or in ex13
>
> -dm_plex_box_simplex 0
>
> to get hexes, which do not use a generator. Second, you are interpolating
> all on process 0, which is probably
> the bulk of the time. I do that because I do not care about parallel
> performance in the examples and it is simpler.
> You can also refine the mesh after distribution, which is faster, and cuts
> down on the distribution time. So if you
> want the whole thing, you could use
>
> DM odm, dm;
>
> /* Create a cell-vertex box mesh */
> ierr = DMPlexCreateBoxMesh(comm, 2, PETSC_TRUE, NULL, NULL, NULL, NULL,
> PETSC_FALSE, &odm);CHKERRQ(ierr);
> ierr = PetscObjectSetOptionsPrefix((PetscObject) dm,
> "orig_");CHKERRQ(ierr);
> /* Distributes the mesh here */
> ierr = DMSetFromOptions(odm);CHKERRQ(ierr);
> /* Interpolate the mesh */
> ierr = DMPlexInterpolate(odm, &dm);CHKERRQ(ierr);
> ierr = DMDestroy(&odm);CHKERRQ(ierr);
> /* Refine the mesh */
> ierr = DMSetFromOptions(dm);CHKERRQ(ierr);
>
> and run with
>
> -dm_plex_box_simplex 0 -dm_plex_box_faces 100,100 -orig_dm_distribute
> -dm_refine 3
>
> Thanks,
>
> Matt
>
> Thanks!
>> Chonglin
>>
>> On Sep 24, 2020, at 2:06 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>
>> On Thu, Sep 24, 2020 at 2:04 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> On Thu, Sep 24, 2020 at 1:38 PM Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Thu, Sep 24, 2020 at 12:48 PM Zhang, Chonglin <zhangc20 at rpi.edu>
>>>> wrote:
>>>>
>>>>> Thanks Mark and Barry!
>>>>>
>>>>> A quick try of using “-pc_type jacobi” did reduce the number of count
>>>>> for “CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts
>>>>> did not decrease in this case) solves the problem faster (may not be of any
>>>>> meaning since the problem size is too small; the function
>>>>> “DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size
>>>>> preventing running larger problems, separate issue).
>>>>>
>>>>
>>>> It sounds like something is wrong then, or I do not understand what you
>>>> mean by slow.
>>>>
>>>
>>> sor may be the default so you need to set the -mg_level_ksp[pc]_type
>>> chebyshev[jacobi]. chebyshev is the ksp default.
>>>
>>
>> I meant for the mesh creation.
>>
>> Thanks,
>>
>> Matt
>>
>>
>>> Thanks,
>>>>
>>>> Matt
>>>>
>>>>
>>>>> Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a
>>>>> significant amount of time for a realistic sized problem, say for example a
>>>>> linear problem with ~1-2 million DOFs?
>>>>>
>>>>> Also, is there any plan to have the SNES and DMPlex code run on GPU?
>>>>>
>>>>> Thanks!
>>>>> Chonglin
>>>>>
>>>>> On Sep 24, 2020, at 12:17 PM, Barry Smith <bsmith at petsc.dev> wrote:
>>>>>
>>>>>
>>>>> MatSOR() runs on the CPU, this causes copy to CPU for each
>>>>> application of MatSOR() and then a copy to GPU for the next step.
>>>>>
>>>>> You can try, for example -pc_type jacobi better yet use PCGAMG if
>>>>> it amenable for your problem.
>>>>>
>>>>> Also the problem is way to small for a GPU.
>>>>>
>>>>> There will be copies between the GPU/CPU for each SNES iteration
>>>>> since the DMPLEX code does not run on GPUs.
>>>>>
>>>>> Barry
>>>>>
>>>>>
>>>>>
>>>>> On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin <zhangc20 at rpi.edu>
>>>>> wrote:
>>>>>
>>>>> Dear PETSc Users,
>>>>>
>>>>> I have some questions regarding the proper GPU usage. I would like to
>>>>> know the proper way to:
>>>>> (1) solve linear equation in SNES, using GPU in PETSc; what
>>>>> syntax/arguments should I be using;
>>>>> (2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data
>>>>> transfer showed in PETSc log file, when using CUDA aware MPI.
>>>>>
>>>>>
>>>>> Details of what I am doing now and my observations are below:
>>>>>
>>>>> System and compilers used:
>>>>> (1) RPI’s AiMOS computer (node wise, it is the same as Summit);
>>>>> (2) using GCC 7.4.0 and Spectrum-MPI 10.3.
>>>>>
>>>>> I am doing the followings to solve the linear Poisson equation with
>>>>> SNES interface, under DMPlex:
>>>>> (1) using DMPlex to set up the unstructured mesh;
>>>>> (2) using DM to create vector and matrix;
>>>>> (3) using SNES interface to solve the linear Poisson equation, with
>>>>> “-snes_type ksponly”;
>>>>> (4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU
>>>>> vector and matrix, as suggested in this webpage:
>>>>> https://www.mcs.anl.gov/petsc/features/gpus.html
>>>>> (5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to
>>>>> enable GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit):
>>>>> https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct;
>>>>> https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf
>>>>> (6) using “-options_left” to check and make sure all the arguments are
>>>>> accepted and used by PETSc.
>>>>> (7) After problem setup, I am running the “SNESSolve()” multiple times
>>>>> to solve the linear problem and observe the log file with “-log_view"
>>>>>
>>>>> I noticed that if I run “SNESSolve()” 500 times, instead of 50 times,
>>>>> the “CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for
>>>>> some of the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo,
>>>>> VecCUDACopyFrom, MatCUSPARSCopyTo. See below for a truncated log
>>>>> corresponding to running SNESSolve() 500 times:
>>>>>
>>>>>
>>>>> Event Count Time (sec) Flop
>>>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - -
>>>>> GpuToCpu - GPU
>>>>> Max Ratio Max Ratio Max Ratio Mess
>>>>> AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size
>>>>> Count Size %F
>>>>>
>>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> --- Event Stage 0: Main Stage
>>>>>
>>>>> BuildTwoSided 510 1.0 4.9205e-03 1.1 0.00e+00 0.0 3.5e+01
>>>>> 4.0e+00 1.0e+03 0 0 0 0 0 0 0 21 0 0 0 0 0
>>>>> 0.00e+00 0 0.00e+00 0
>>>>> BuildTwoSidedF 501 1.0 1.0199e-02 1.4 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 1.0e+03 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>>> 0.00e+00 0 0.00e+00 0
>>>>> SNESSolve 500 1.0 3.2570e+02 1.0 1.18e+10 1.0 0.0e+00
>>>>> 0.0e+00 8.7e+05100100 0 0100 100100 0 0100 144 202 31947
>>>>> 7.82e+02 63363 1.44e+03 82
>>>>> SNESSetUp 1 1.0 6.0082e-04 1.7 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>>> 0.00e+00 0 0.00e+00 0
>>>>> SNESFunctionEval 500 1.0 3.9826e+01 1.0 3.60e+08 1.0 0.0e+00
>>>>> 0.0e+00 5.0e+02 12 3 0 0 0 12 3 0 0 0 36 13 0
>>>>> 0.00e+00 1000 2.48e+01 0
>>>>> SNESJacobianEval 500 1.0 4.8200e+01 1.0 5.97e+08 1.0 0.0e+00
>>>>> 0.0e+00 2.0e+03 15 5 0 0 0 15 5 0 0 0 50 0 1000
>>>>> 7.77e+01 500 1.24e+01 0
>>>>> DMPlexResidualFE 500 1.0 3.6923e+01 1.1 3.56e+08 1.0 0.0e+00
>>>>> 0.0e+00 0.0e+00 10 3 0 0 0 10 3 0 0 0 39 0 0
>>>>> 0.00e+00 500 1.24e+01 0
>>>>> DMPlexJacobianFE 500 1.0 4.6013e+01 1.0 5.95e+08 1.0 0.0e+00
>>>>> 0.0e+00 2.0e+03 14 5 0 0 0 14 5 0 0 0 52 0 1000
>>>>> 7.77e+01 0 0.00e+00 0
>>>>> MatSOR 30947 1.0 3.1254e+00 1.1 1.21e+09 1.0 0.0e+00
>>>>> 0.0e+00 0.0e+00 1 10 0 0 0 1 10 0 0 0 1542 0 0
>>>>> 0.00e+00 61863 1.41e+03 0
>>>>> MatAssemblyBegin 511 1.0 5.3428e+00256.4 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 2.0e+03 1 0 0 0 0 1 0 0 0 0 0 0 0
>>>>> 0.00e+00 0 0.00e+00 0
>>>>> MatAssemblyEnd 511 1.0 4.3440e-02 1.2 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 2.1e+01 0 0 0 0 0 0 0 0 0 0 0 0 1002
>>>>> 7.80e+01 0 0.00e+00 0
>>>>> MatCUSPARSCopyTo 1002 1.0 3.6557e-02 1.2 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1002
>>>>> 7.80e+01 0 0.00e+00 0
>>>>> VecMDot 29930 1.0 3.7843e+01 1.0 2.62e+09 1.0 0.0e+00
>>>>> 0.0e+00 6.0e+04 12 22 0 0 7 12 22 0 0 7 277 3236 29930
>>>>> 6.81e+02 0 0.00e+00 100
>>>>> VecNorm 31447 1.0 2.1164e+01 1.4 1.79e+08 1.0 0.0e+00
>>>>> 0.0e+00 6.3e+04 5 2 0 0 7 5 2 0 0 7 34 55 1017
>>>>> 2.31e+01 0 0.00e+00 100
>>>>> VecNormalize 30947 1.0 2.3957e+01 1.1 2.65e+08 1.0 0.0e+00
>>>>> 0.0e+00 6.2e+04 7 2 0 0 7 7 2 0 0 7 44 51 1017
>>>>> 2.31e+01 0 0.00e+00 100
>>>>> VecCUDACopyTo 30947 1.0 7.8866e+00 3.4 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 0 30947
>>>>> 7.04e+02 0 0.00e+00 0
>>>>> VecCUDACopyFrom 63363 1.0 1.0873e+00 1.1 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>>> 0.00e+00 63363 1.44e+03 0
>>>>> KSPSetUp 500 1.0 2.2737e-03 1.1 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 5.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>>> 0.00e+00 0 0.00e+00 0
>>>>> KSPSolve 500 1.0 2.3687e+02 1.0 1.08e+10 1.0 0.0e+00
>>>>> 0.0e+00 8.6e+05 72 92 0 0 99 73 92 0 0 99 182 202 30947
>>>>> 7.04e+02 61863 1.41e+03 89
>>>>> KSPGMRESOrthog 29930 1.0 1.8920e+02 1.0 7.87e+09 1.0 0.0e+00
>>>>> 0.0e+00 6.4e+05 58 67 0 0 74 58 67 0 0 74 166 209 29930
>>>>> 6.81e+02 0 0.00e+00 100
>>>>> PCApply 30947 1.0 3.1555e+00 1.1 1.21e+09 1.0 0.0e+00
>>>>> 0.0e+00 0.0e+00 1 10 0 0 0 1 10 0 0 0 1527 0 0
>>>>> 0.00e+00 61863 1.41e+03 0
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Chonglin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>> https://www.cse.buffalo.edu/~knepley/
>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>
>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>>
>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200924/e7d35916/attachment-0001.html>
More information about the petsc-users
mailing list