[petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

Wed Feb 22 11:12:15 CST 2023

On Tue, Feb 7, 2023 at 6:40 AM Matthew Knepley <knepley at gmail.com> wrote:

> On Tue, Feb 7, 2023 at 6:23 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>> I do one complete solve to get everything setup, to be safe.
>>
>> src/ts/tutorials/ex13.c does this and runs multiple solves, if you like
>> but one solve is probably fine.
>>
>
> I think that is SNES ex13
>

Yes, it is src/snes/tests/ex13.c

>
>   Matt
>
>
>> This was designed as a benchmark and is nice because it can do any order
>> FE solve of Poisson (uses DM/PetscFE, slow).
>> src/ksp/ksp/tutorials/ex56.c is old school, hardwired for elasticity but
>> is simpler and the setup is faster if you are doing large problems per MPI
>> process.
>>
>> Mark
>>
>> On Mon, Feb 6, 2023 at 2:06 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>  It should not crash, take a look at the test cases at the bottom of the
>>> file. You are likely correct if the code, unfortunately, does use
>>> DMCreateMatrix() it will not work out of the box for geometric multigrid.
>>> So it might be the wrong example for you.
>>>
>>>   I don't know what you mean about clever. If you simply set the
>>> solution to zero at the beginning of the loop then it will just do the same
>>> solve multiple times. The setup should not do much of anything after the
>>> first solver.  Thought usually solves are big enough that one need not run
>>> solves multiple times to get a good understanding of their performance.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley <
>>> paul.grosse-bley at ziti.uni-heidelberg.de> wrote:
>>>
>>> Hi Barry,
>>>
>>> src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting
>>> point, thank you! Sadly I get a segfault when executing that example with
>>> PCMG and more than one level, i.e. with the minimal args:
>>>
>>> $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
>>> ===========================================
>>> Test: KSP performance - Poisson
>>>     Input matrix: 27-pt finite difference stencil
>>>     -n 100
>>>     DoFs = 1000000
>>>     Number of nonzeros = 26463592
>>>
>>> Step1  - creating Vecs and Mat...
>>> Step2a - running PCSetUp()...
>>> [0]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>> probably memory access out of range
>>> [0]PETSC ERROR: Try option -start_in_debugger or
>>> -on_error_attach_debugger
>>> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
>>> https://petsc.org/release/faq/
>>> [0]PETSC ERROR: or try
>>> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA
>>> systems to find memory corruption errors
>>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
>>> and run
>>> [0]PETSC ERROR: to get more information on the crash.
>>> [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is
>>> causing the crash.
>>>
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>> with errorcode 59.
>>>
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> As the matrix is not created using DMDACreate3d I expected it to fail
>>> due to the missing geometric information, but I expected it to fail more
>>> gracefully than with a segfault.
>>> I will try to combine bench_kspsolve.c with ex45.c to get easy MG
>>> preconditioning, especially since I am interested in the 7pt stencil for
>>> now.
>>>
>>> Concerning my benchmarking loop from before: Is it generally discouraged
>>> to do this for KSPSolve due to PETSc cleverly/lazily skipping some of the
>>> work when doing the same solve multiple times or are the solves not
>>> iterated in bench_kspsolve.c (while the MatMuls are with -matmult) just to
>>> keep the runtime short?
>>>
>>> Thanks,
>>> Paul
>>>
>>> On Monday, February 06, 2023 17:04 CET, Barry Smith <bsmith at petsc.dev>
>>> wrote:
>>>
>>>
>>>
>>>
>>>
>>>   Paul,
>>>
>>>    I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to
>>> be used for simple benchmarking.
>>>
>>>    You can use VecCudaGetArray() to access the GPU memory of the vector
>>> and then call a CUDA kernel to compute the right hand side vector directly
>>> on the GPU.
>>>
>>>   Barry
>>>
>>>
>>>
>>> On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley <
>>> paul.grosse-bley at ziti.uni-heidelberg.de> wrote:
>>>
>>> Hi,
>>>
>>> I want to compare different implementations of multigrid solvers for
>>> Nvidia GPUs using the poisson problem (starting from ksp tutorial example
>>> ex45.c).
>>> Therefore I am trying to get runtime results comparable to hpgmg-cuda
>>> <https://bitbucket.org/nsakharnykh/hpgmg-cuda/src/master/>
>>> (finite-volume), i.e. using multiple warmup and measurement solves and
>>> avoiding measuring setup time.
>>> For now I am using -log_view with added stages:
>>>
>>> PetscLogStageRegister("Solve Bench", &solve_bench_stage);
>>>   for (int i = 0; i < BENCH_SOLVES; i++) {
>>>     PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess,
>>> NULL)); // reset x
>>>     PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve
>>>     PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that
>>> everything is done
>>>
>>>     PetscLogStagePush(solve_bench_stage);
>>>     PetscCall(KSPSolve(ksp, NULL, NULL));
>>>     PetscLogStagePop();
>>>   }
>>>
>>> This snippet is preceded by a similar loop for warmup.
>>>
>>> When profiling this using Nsight Systems, I see that the very first
>>> solve is much slower which mostly correspods to H2D (host to device) copies
>>> and e.g. cuBLAS setup (maybe also paging overheads as mentioned in the
>>> docs
>>> <https://petsc.org/release/docs/manual/profiling/#accurate-profiling-and-paging-overheads>,
>>> but probably insignificant in this case). The following solves have some
>>> overhead at the start from a H2D copy of a vector (the RHS I guess, as the
>>> copy is preceeded by a matrix-vector product) in the first MatResidual call
>>> (callchain:
>>> KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My
>>> interpretation of the profiling results (i.e. cuBLAS calls) is that that
>>> vector is overwritten with the residual in Daxpy and therefore has to be
>>> copied again for the next iteration.
>>>
>>> Is there an elegant way of avoiding that H2D copy? I have seen some
>>> examples on constructing matrices directly on the GPU, but nothing about
>>> vectors. Any further tips for benchmarking (vs profiling) PETSc solvers? At
>>> the moment I am using jacobi as smoother, but I would like to have a CUDA
>>> implementation of SOR instead. Is there a good way of achieving that, e.g.
>>> using PCHYPREs boomeramg with a single level and "SOR/Jacobi"-smoother  as
>>> smoother in PCMG? Or is the overhead from constantly switching between
>>> PETSc and hypre too big?
>>>
>>> Thanks,
>>> Paul
>>>
>>>
>>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230222/1efa89c9/attachment-0001.html>