[petsc-dev] PETSc GPU example

Tue Dec 7 17:10:36 CST 2021

Fande, let me know if you'd like to run the libCEED hyperelasticity solver on GPUs. It's matrix-free p-multigrid (assembling only the coarse problem). It uses DMPlex for mesh management.

Mark Adams <mfadams at lbl.gov> writes:

> Also Fande,
>
> If you are _not_ using NVIDIA with the MPS system, then you should run with
> the default -cells 1,1,1, and use just one MPI process and one GPU.
> This will be fine for evaluating the GPU.
>
> If you want to use more than one MPI process per GPU (because you want to
> use the CPUs in the rest of your app) then the MPS system is important (I
> see 3x speedup) and I would use NVIDIA+MPS unless you talk to
> someone/vendor knowledgeable about using more than one MPI/GPU.
>
> Now if you can use NVIDIA+MPS it would be interesting to compare GPU solver
> performance with single vs multiple MPI/GPU. It should be faster to use one
> MPI/GPU (running the same problem of course), but it would be interesting
> to quantify this.
> If you want to do this then I can explain how to do it.
>
> Thanks,
> Mark
>
> On Mon, Dec 6, 2021 at 10:03 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>> * snes/ex56 runs a convergence study and confusingly sets the options
>> manually, thus erasing your -ex56_dm_refine.
>>
>> * To refine, use -max_conv_its N <3>, this sets the number of steps of
>> refinement. That is, the length of the convergence study
>>
>> * You can adjust where it starts from with -cells i,j,k <1,1,1>
>> You do want to set this if you have multiple MPI processes so that the
>> size of this mesh is the number of processes. That way it starts with one
>> cell per process and refines from there.
>>
>> * GPU speedup is all about subdomain size. AMG has lots of kernel launches
>> and you need to overcome this before you get net gain.
>> Very rough numbers: I see a speedup of about 5-10x with a few million
>> equations per GPU.
>> As Matt said the assembly is on the CPU and ex56 gets really slow on
>> larger problems. Be prepared to run the largest case for close to an hour.
>> This setup is not measured in KSP[SNES]Solve in the -log_view output so
>> look at that.
>> When you do this convergence study there will be a new stage created for
>> each refinement, so one run will give you a range of problem size data.
>> Each refinement step increases the problem size by 8x so when the solve
>> times increase by ~8x then that tells you you are past the
>> latency dominated regime. You want to get into that to see gain.
>>
>> * The end of the source file has example parameters that you should use
>> (the gamg one)
>>
>> * src/snes/tests/ex13.c is designed to be a benchmark test and it
>> partitions the problem better in parallel and has modern Plex usage. If you
>> are doing large scale parallelism then you should use this.
>> (It is a little hard to understand. Not well documented.)
>>
>> Hope that helps,
>> Mark
>>
>> On Mon, Dec 6, 2021 at 9:05 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>>
>>>
>>>
>>> On Mon, Dec 6, 2021 at 5:59 PM Matthew Knepley <knepley at gmail.com> wrote:
>>>
>>>> On Mon, Dec 6, 2021 at 7:54 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>>>>
>>>>> Thanks, Matt,
>>>>>
>>>>> Sorry, I still have more questions on this example. How to refine mesh
>>>>> to make the problem larger?
>>>>>
>>>>> I tried the following options, and none of them worked. I might do
>>>>> something wrong.
>>>>>
>>>>> -ex56_dm_refine 9
>>>>>
>>>>> and
>>>>>
>>>>> -dm_refine 4
>>>>>
>>>>
>>>> The mesh handling in this example does not conform to the others, but it
>>>> appears that
>>>>
>>>>   -ex56_dm_refine <k>
>>>>
>>>> should take effect at
>>>>
>>>>
>>>> https://gitlab.com/petsc/petsc/-/blob/main/src/snes/tutorials/ex56.c#L381
>>>>
>>>>
>>> I was puzzled about this because DMSetFromOptions does not seem to
>>> trigger -ex56_dm_refine.
>>>
>>> I did a search, and could not find where we call " -ex56_dm_refine" in
>>> PETSc.
>>>
>>> I got the same result by running the following two combinations:
>>>
>>> 1) ./ex56  -log_view  -snes_view  -max_conv_its 3 -ex56_dm_refine 10
>>>
>>> 2) ./ex56  -log_view  -snes_view  -max_conv_its 3 -ex56_dm_refine 0
>>>
>>> Thanks,
>>>
>>> Fande
>>>
>>>
>>> unless you are setting max_conv_its to 0 somehow.
>>>>
>>>>   Thanks,
>>>>
>>>>      Matt
>>>>
>>>>
>>>>> Thanks,
>>>>>
>>>>> Fande
>>>>>
>>>>> On Mon, Dec 6, 2021 at 5:04 PM Matthew Knepley <knepley at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> On Mon, Dec 6, 2021 at 7:02 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>>>>>>
>>>>>>> Thanks, Matt
>>>>>>>
>>>>>>> On Mon, Dec 6, 2021 at 4:47 PM Matthew Knepley <knepley at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Mon, Dec 6, 2021 at 6:40 PM Fande Kong <fdkong.jd at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Dear PETSc team,
>>>>>>>>>
>>>>>>>>> I am interested in a careful evaluation of PETSc GPU performance in
>>>>>>>>> our INL cluster.
>>>>>>>>>
>>>>>>>>> Any example in PETSc that can show GPU speedup with solving a
>>>>>>>>> nonlinear equation?
>>>>>>>>>
>>>>>>>>> I talked to Junchao; he suggested that I try SNES/tutorial/ex56. I
>>>>>>>>> tried that, but I could not find any speedup using the GPU. I could attach
>>>>>>>>> some results of "log_view" later if we would like to see that.
>>>>>>>>>
>>>>>>>>
>>>>>>>> We should note that you will only see speedup in the solver, so that
>>>>>>>> problem has to be pretty large. I believe Mark has good results with it.
>>>>>>>> The assembly is still all on the CPU. I am working on this over
>>>>>>>> break, and hope to have a CEED version of it by the new year.
>>>>>>>>
>>>>>>>
>>>>>>> Are both function and matrix assmelies on CPU? Or just the matrix
>>>>>>> assembly?
>>>>>>>
>>>>>>
>>>>>> There is no GPU assembly right now.
>>>>>>
>>>>>>   Matt
>>>>>>
>>>>>>
>>>>>>> OK, I will try to check the solver part
>>>>>>>
>>>>>>> Thanks, again
>>>>>>>
>>>>>>> Fande
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>   Thanks,
>>>>>>>>
>>>>>>>>      Matt
>>>>>>>>
>>>>>>>>
>>>>>>>>> Appreciate any instructions/comments about running a simple PETSc
>>>>>>>>> GPU example to get a speedup.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Fande
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>>> experiments lead.
>>>>>>>> -- Norbert Wiener
>>>>>>>>
>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> What most experimenters take for granted before they begin their
>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>> experiments lead.
>>>>>> -- Norbert Wiener
>>>>>>
>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>> https://www.cse.buffalo.edu/~knepley/
>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>
>>>