[petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

Wed Feb 22 11:24:07 CST 2023

On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley <
paul.grosse-bley at ziti.uni-heidelberg.de> wrote:

> Hi Barry,
>
> after using VecCUDAGetArray to initialize the RHS, that kernel still gets
> called as part of KSPSolve instead of KSPSetup, but its runtime is way less
> significant than the cudaMemcpy before, so I guess I will leave it like
> this. Other than that I kept the code like in my first message in this
> thread (as you wrote, benchmark_ksp.c is not well suited for PCMG).
>
> The profiling results for PCMG and PCAMG look as I would expect them to
> look, i.e. one can nicely see the GPU load/kernel runtimes going down and
> up again for each V-cycle.
>
> I was wondering about -pc_mg_multiplicative_cycles as it does not seem to
> make any difference. I would have expected to be able to increase the
> number of V-cycles per KSP iteration, but I keep seeing just a single
> V-cycle when changing the option (using PCMG).
>

How are you seeing this?
You might try -log_trace to see if you get two V cycles.

>
> When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess
> between bench iterations to reset the solution vector does not seem to work
> as the residual keeps shrinking. Is this a bug? Any advice for working
> around this?
>
>
Looking at the doc
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/
you use this with  KSPSetComputeRHS.

In src/snes/tests/ex13.c I just zero out the solution vector.

> The profile for BoomerAMG also doesn't really show the V-cycle behavior of
> the other implementations. Most of the runtime seems to go into calls to
> cusparseDcsrsv which might happen at the different levels, but the runtime
> of these kernels doesn't show the V-cycle pattern. According to the output
> with -pc_hypre_boomeramg_print_statistics it is doing the right thing
> though, so I guess it is alright (and if not, this is probably the wrong
> place to discuss it).

> When using PCAMGX, I see two PCApply (each showing a nice V-cycle
> behavior) calls in KSPSolve (three for the very first KSPSolve) while
> expecting just one. Each KSPSolve should do a single preconditioned
> Richardson iteration. Why is the preconditioner applied multiple times here?
>
>
Again, not sure what "see" is, but PCAMGX is pretty new and has not been
used much.
Note some KSP methods apply to the PC before the iteration.

Mark

> Thank you,
> Paul Große-Bley
>
>
> On Monday, February 06, 2023 20:05 CET, Barry Smith <bsmith at petsc.dev>
> wrote:
>
>
>
>
>
>  It should not crash, take a look at the test cases at the bottom of the
> file. You are likely correct if the code, unfortunately, does use
> DMCreateMatrix() it will not work out of the box for geometric multigrid.
> So it might be the wrong example for you.
>
>   I don't know what you mean about clever. If you simply set the solution
> to zero at the beginning of the loop then it will just do the same solve
> multiple times. The setup should not do much of anything after the first
> solver.  Thought usually solves are big enough that one need not run solves
> multiple times to get a good understanding of their performance.
>
>
>
>
>
>
>
> On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley <
> paul.grosse-bley at ziti.uni-heidelberg.de> wrote:
>
> Hi Barry,
>
> src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting
> point, thank you! Sadly I get a segfault when executing that example with
> PCMG and more than one level, i.e. with the minimal args:
>
> $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
> ===========================================
> Test: KSP performance - Poisson
>     Input matrix: 27-pt finite difference stencil
>     -n 100
>     DoFs = 1000000
>     Number of nonzeros = 26463592
>
> Step1  - creating Vecs and Mat...
> Step2a - running PCSetUp()...
> [0]PETSC ERROR:
> ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [0]PETSC ERROR: or try
> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA
> systems to find memory corruption errors
> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and
> run
> [0]PETSC ERROR: to get more information on the crash.
> [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is
> causing the crash.
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 59.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
>
> As the matrix is not created using DMDACreate3d I expected it to fail due
> to the missing geometric information, but I expected it to fail more
> gracefully than with a segfault.
> I will try to combine bench_kspsolve.c with ex45.c to get easy MG
> preconditioning, especially since I am interested in the 7pt stencil for
> now.
>
> Concerning my benchmarking loop from before: Is it generally discouraged
> to do this for KSPSolve due to PETSc cleverly/lazily skipping some of the
> work when doing the same solve multiple times or are the solves not
> iterated in bench_kspsolve.c (while the MatMuls are with -matmult) just to
> keep the runtime short?
>
> Thanks,
> Paul
>
> On Monday, February 06, 2023 17:04 CET, Barry Smith <bsmith at petsc.dev>
> wrote:
>
>
>
>
>
>   Paul,
>
>    I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to
> be used for simple benchmarking.
>
>    You can use VecCudaGetArray() to access the GPU memory of the vector
> and then call a CUDA kernel to compute the right hand side vector directly
> on the GPU.
>
>   Barry
>
>
>
> On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley <
> paul.grosse-bley at ziti.uni-heidelberg.de> wrote:
>
> Hi,
>
> I want to compare different implementations of multigrid solvers for
> Nvidia GPUs using the poisson problem (starting from ksp tutorial example
> ex45.c).
> Therefore I am trying to get runtime results comparable to hpgmg-cuda
> <https://bitbucket.org/nsakharnykh/hpgmg-cuda/src/master/>
> (finite-volume), i.e. using multiple warmup and measurement solves and
> avoiding measuring setup time.
> For now I am using -log_view with added stages:
>
> PetscLogStageRegister("Solve Bench", &solve_bench_stage);
>   for (int i = 0; i < BENCH_SOLVES; i++) {
>     PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL));
> // reset x
>     PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve
>     PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that
> everything is done
>
>     PetscLogStagePush(solve_bench_stage);
>     PetscCall(KSPSolve(ksp, NULL, NULL));
>     PetscLogStagePop();
>   }
>
> This snippet is preceded by a similar loop for warmup.
>
> When profiling this using Nsight Systems, I see that the very first solve
> is much slower which mostly correspods to H2D (host to device) copies and
> e.g. cuBLAS setup (maybe also paging overheads as mentioned in the docs
> <https://petsc.org/release/docs/manual/profiling/#accurate-profiling-and-paging-overheads>,
> but probably insignificant in this case). The following solves have some
> overhead at the start from a H2D copy of a vector (the RHS I guess, as the
> copy is preceeded by a matrix-vector product) in the first MatResidual call
> (callchain:
> KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My
> interpretation of the profiling results (i.e. cuBLAS calls) is that that
> vector is overwritten with the residual in Daxpy and therefore has to be
> copied again for the next iteration.
>
> Is there an elegant way of avoiding that H2D copy? I have seen some
> examples on constructing matrices directly on the GPU, but nothing about
> vectors. Any further tips for benchmarking (vs profiling) PETSc solvers? At
> the moment I am using jacobi as smoother, but I would like to have a CUDA
> implementation of SOR instead. Is there a good way of achieving that, e.g.
> using PCHYPREs boomeramg with a single level and "SOR/Jacobi"-smoother  as
> smoother in PCMG? Or is the overhead from constantly switching between
> PETSc and hypre too big?
>
> Thanks,
> Paul
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230222/070eee7a/attachment.html>