[petsc-users] Proper GPU usage in PETSc

Thu Sep 24 13:06:29 CDT 2020

On Thu, Sep 24, 2020 at 2:04 PM Mark Adams <mfadams at lbl.gov> wrote:

> On Thu, Sep 24, 2020 at 1:38 PM Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Thu, Sep 24, 2020 at 12:48 PM Zhang, Chonglin <zhangc20 at rpi.edu>
>> wrote:
>>
>>> Thanks Mark and Barry!
>>>
>>> A quick try of using “-pc_type jacobi” did reduce the number of count
>>> for “CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts
>>> did not decrease in this case) solves the problem faster (may not be of any
>>> meaning since the problem size is too small; the function
>>> “DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size
>>> preventing running larger problems, separate issue).
>>>
>>
>> It sounds like something is wrong then, or I do not understand what you
>> mean by slow.
>>
>
> sor may be the default so you need to set the -mg_level_ksp[pc]_type
> chebyshev[jacobi]. chebyshev is the ksp default.
>

I meant for the mesh creation.

  Thanks,

     Matt

>   Thanks,
>>
>>      Matt
>>
>>
>>> Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a
>>> significant amount of time for a realistic sized problem, say for example a
>>> linear problem with ~1-2 million DOFs?
>>>
>>> Also, is there any plan to have the SNES and DMPlex code run on GPU?
>>>
>>> Thanks!
>>> Chonglin
>>>
>>> On Sep 24, 2020, at 12:17 PM, Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>
>>>    MatSOR() runs on the CPU, this causes copy to CPU for each
>>> application of MatSOR() and then a copy to GPU for the next step.
>>>
>>>    You can try, for example -pc_type jacobi  better yet use PCGAMG if it
>>> amenable for your problem.
>>>
>>>    Also the problem is way to small for a GPU.
>>>
>>>   There will be copies between the GPU/CPU for each SNES iteration since
>>> the DMPLEX code does not run on GPUs.
>>>
>>>    Barry
>>>
>>>
>>>
>>> On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin <zhangc20 at rpi.edu> wrote:
>>>
>>> Dear PETSc Users,
>>>
>>> I have some questions regarding the proper GPU usage. I would like to
>>> know the proper way to:
>>> (1) solve linear equation in SNES, using GPU in PETSc; what
>>> syntax/arguments should I be using;
>>> (2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data
>>> transfer showed in PETSc log file, when using CUDA aware MPI.
>>>
>>>
>>> Details of what I am doing now and my observations are below:
>>>
>>> System and compilers used:
>>> (1) RPI’s AiMOS computer (node wise, it is the same as Summit);
>>> (2) using GCC 7.4.0 and Spectrum-MPI 10.3.
>>>
>>> I am doing the followings to solve the linear Poisson equation with SNES
>>> interface, under DMPlex:
>>> (1) using DMPlex to set up the unstructured mesh;
>>> (2) using DM to create vector and matrix;
>>> (3) using SNES interface to solve the linear Poisson equation, with
>>> “-snes_type ksponly”;
>>> (4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU
>>> vector and matrix, as suggested in this webpage:
>>> https://www.mcs.anl.gov/petsc/features/gpus.html
>>> (5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to
>>> enable GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit):
>>> https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct;
>>> https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf
>>> (6) using “-options_left” to check and make sure all the arguments are
>>> accepted and used by PETSc.
>>> (7) After problem setup, I am running the “SNESSolve()” multiple times
>>> to solve the linear problem and observe the log file with “-log_view"
>>>
>>> I noticed that if I run “SNESSolve()” 500 times, instead of 50 times,
>>> the “CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for
>>> some of the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo,
>>> VecCUDACopyFrom, MatCUSPARSCopyTo. See below for a truncated log
>>> corresponding to running SNESSolve() 500 times:
>>>
>>>
>>> Event                Count      Time (sec)     Flop
>>>         --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
>>> GpuToCpu - GPU
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count
>>>   Size  %F
>>>
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> BuildTwoSided        510 1.0 4.9205e-03 1.1 0.00e+00 0.0 3.5e+01 4.0e+00
>>> 1.0e+03  0  0  0  0  0   0  0 21  0  0     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> BuildTwoSidedF       501 1.0 1.0199e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 1.0e+03  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> SNESSolve            500 1.0 3.2570e+02 1.0 1.18e+10 1.0 0.0e+00 0.0e+00
>>> 8.7e+05100100  0  0100 100100  0  0100   144     202   31947 7.82e+02 63363
>>> 1.44e+03 82
>>> SNESSetUp              1 1.0 6.0082e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> SNESFunctionEval     500 1.0 3.9826e+01 1.0 3.60e+08 1.0 0.0e+00 0.0e+00
>>> 5.0e+02 12  3  0  0  0  12  3  0  0  0    36      13      0 0.00e+00 1000
>>> 2.48e+01  0
>>> SNESJacobianEval     500 1.0 4.8200e+01 1.0 5.97e+08 1.0 0.0e+00 0.0e+00
>>> 2.0e+03 15  5  0  0  0  15  5  0  0  0    50       0   1000 7.77e+01  500
>>> 1.24e+01  0
>>> DMPlexResidualFE     500 1.0 3.6923e+01 1.1 3.56e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 10  3  0  0  0  10  3  0  0  0    39       0      0 0.00e+00  500
>>> 1.24e+01  0
>>> DMPlexJacobianFE     500 1.0 4.6013e+01 1.0 5.95e+08 1.0 0.0e+00 0.0e+00
>>> 2.0e+03 14  5  0  0  0  14  5  0  0  0    52       0   1000 7.77e+01    0
>>> 0.00e+00  0
>>> MatSOR             30947 1.0 3.1254e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  1 10  0  0  0   1 10  0  0  0  1542       0      0 0.00e+00 61863
>>> 1.41e+03  0
>>> MatAssemblyBegin     511 1.0 5.3428e+00256.4 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 2.0e+03  1  0  0  0  0   1  0  0  0  0     0       0      0
>>> 0.00e+00    0 0.00e+00  0
>>> MatAssemblyEnd       511 1.0 4.3440e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 2.1e+01  0  0  0  0  0   0  0  0  0  0     0       0   1002 7.80e+01    0
>>> 0.00e+00  0
>>> MatCUSPARSCopyTo    1002 1.0 3.6557e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0   1002 7.80e+01    0
>>> 0.00e+00  0
>>> VecMDot            29930 1.0 3.7843e+01 1.0 2.62e+09 1.0 0.0e+00 0.0e+00
>>> 6.0e+04 12 22  0  0  7  12 22  0  0  7   277    3236   29930 6.81e+02    0
>>> 0.00e+00 100
>>> VecNorm            31447 1.0 2.1164e+01 1.4 1.79e+08 1.0 0.0e+00 0.0e+00
>>> 6.3e+04  5  2  0  0  7   5  2  0  0  7    34      55   1017 2.31e+01    0
>>> 0.00e+00 100
>>> VecNormalize       30947 1.0 2.3957e+01 1.1 2.65e+08 1.0 0.0e+00 0.0e+00
>>> 6.2e+04  7  2  0  0  7   7  2  0  0  7    44      51   1017 2.31e+01    0
>>> 0.00e+00 100
>>> VecCUDACopyTo      30947 1.0 7.8866e+00 3.4 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0       0   30947 7.04e+02    0
>>> 0.00e+00  0
>>> VecCUDACopyFrom    63363 1.0 1.0873e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00 63363
>>> 1.44e+03  0
>>> KSPSetUp             500 1.0 2.2737e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 5.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> KSPSolve             500 1.0 2.3687e+02 1.0 1.08e+10 1.0 0.0e+00 0.0e+00
>>> 8.6e+05 72 92  0  0 99  73 92  0  0 99   182     202   30947 7.04e+02 61863
>>> 1.41e+03 89
>>> KSPGMRESOrthog     29930 1.0 1.8920e+02 1.0 7.87e+09 1.0 0.0e+00 0.0e+00
>>> 6.4e+05 58 67  0  0 74  58 67  0  0 74   166     209   29930 6.81e+02    0
>>> 0.00e+00 100
>>> PCApply            30947 1.0 3.1555e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  1 10  0  0  0   1 10  0  0  0  1527       0      0 0.00e+00 61863
>>> 1.41e+03  0
>>>
>>>
>>> Thanks!
>>> Chonglin
>>>
>>>
>>>
>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200924/2e720178/attachment.html>