[petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

Wed Feb 22 16:47:06 CST 2023

On Wed, Feb 22, 2023 at 5:45 PM Paul Grosse-Bley <
paul.grosse-bley at ziti.uni-heidelberg.de> wrote:

> I thought to have observed the number of cycles
> with -pc_mg_multiplicative_cycles to be dependent on rtol. But I might have
> seen this with maxits=0 which would explain my missunderstanding of
> richardson.
>
> I guess PCAMGX does not use this PCApplyRichardson_MG (yet?). Because I
> still see the multiple PCApplys there with maxits=1 and richardson, while
> they are gone for PCMG (due to getting rid of  KSPSetComputeInitialGuess?).
>

Yes, that is true. AMGX is a black box, so we cannot look inside the
V-cycle.

  Thanks,

     Matt

> Best,
> Paul Grosse-Bley
>
>
> On Wednesday, February 22, 2023 23:20 CET, Barry Smith <bsmith at petsc.dev>
> wrote:
>
>
>
>
>
>    Preonly means exactly one application of the PC so it will never
> converge by itself unless the PC is a full solver.
>
>    Note there is a PCApplyRichardson_MG() that gets used automatically
> with KSPRICHARSON. This does not have an"extra" application of the
> preconditioner so 2 iterations of Richardson with MG will use 2
> applications of the V-cycle. So it is exactly "multigrid as a solver,
> without a Krylov method", no extra work. So I don't think you need to make
> any "compromises".
>
>   Barry
>
>
>
>
> On Feb 22, 2023, at 4:57 PM, Paul Grosse-Bley <
> paul.grosse-bley at ziti.uni-heidelberg.de> wrote:
>
> Hi again,
>
> I now found out that
>
> 1. preonly ignores -ksp_pc_side right (makes sense, I guess).
> 2. richardson is incompatible with -ksp_pc_side right.
> 3. preonly gives less output for -log_view -pc_mg_log than richardson.
> 4. preonly also ignores -ksp_rtol etc..
> 5. preonly causes -log_view to measure incorrect timings for custom
> stages, i.e. the time for the stage (219us) is significantly shorter than
> the time for the KSPSolve inside the stage (~40ms).
>
> Number 4 will be problematic as I want to benchmark number of V-cycles and
> runtime for a given rtol. At the same time I want to avoid richardson now
> because of number 2 and the additional work of scaling the RHS.
>
> Is there any good way of just using MG V-cycles as a solver, i.e. without
> interference from an outer Krylov solver and still iterate until
> convergence?
> Or will I just have to accept the additional V-cycle due to the left
> application of th PC with richardson?
>
> I guess I could also manually change -pc_mg_multiplicative_cycles until
> the residual gets low enough (using preonly), but that seems very
> inefficient.
>
> Best,
> Paul Große-Bley
>
>
>
> On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" <
> paul.grosse-bley at ziti.uni-heidelberg.de> wrote:
>
>
> I was using the Richardson KSP type which I guess has the same behavior as
> GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use
> preonly from now on, where maxits=1 does what I want it to do.
>
> Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I
> think for now all my problems are resolved for now. Thank you very much,
> Barry and Mark!
>
> Best,
> Paul Große-Bley
>
>
>
> On Wednesday, February 22, 2023 21:03 CET, Barry Smith <bsmith at petsc.dev>
> wrote:
>
>
>
>
>
>
> On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley <
> paul.grosse-bley at ziti.uni-heidelberg.de> wrote:
>
> Hi Barry,
>
> I think most of my "weird" observations came from the fact that I looked
> at iterations of KSPSolve where the residual was already converged. PCMG
> and PCGAMG do one V-cycle before even taking a look at the residual and
> then independent of pc_mg_multiplicative_cycles stop if it is converged.
>
> Looking at iterations that are not converged with PCMG,
> pc_mg_multiplicative_cycles works fine.
>
> At these iterations I also see the multiple calls to PCApply in a single
> KSPSolve iteration which were throwing me off with PCAMGX before.
>
> The reason for these multiple applications of the preconditioner (tested
> for both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This
> could be better documented, I think.
>
>
>    I do not understand what you are talking about with regard to maxits of
> 1 instead of 0. For KSP maxits of 1 means one iteration, 0 is kind of
> meaningless.
>
>    The reason that there is a PCApply at the start of the solve is because
> by default the KSPType is KSPGMRES which by default uses left
> preconditioner which means the right hand side needs to be scaled by the
> preconditioner before the KSP process starts. So in this configuration one
> KSP iteration results in 2 PCApply.  You can use -ksp_pc_side right to use
> right preconditioning and then the number of PCApply will match the number
> of KSP iterations.
>
>
> Best,
> Paul Große-Bley
>
>
>
> On Wednesday, February 22, 2023 20:15 CET, Barry Smith <bsmith at petsc.dev>
> wrote:
>
>
>
>
>
>
> On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley <
> paul.grosse-bley at ziti.uni-heidelberg.de> wrote:
>
> Hi Mark,
>
> I use Nvidia Nsight Systems with --trace
> cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX
> markers that come with -log_view. I.e. I get a nice view of all cuBLAS and
> cuSPARSE calls (in addition to the actual kernels which are not always easy
> to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more
> detailed NVTX markers.
>
> The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear
> because kernel runtimes on coarser levels are much shorter. At the coarsest
> level, there normally isn't even enough work for the GPU (Nvidia A100) to
> be fully occupied which is also visible in Nsight Systems.
>
>
>   Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most
> definitely it changes the run. I am not understanding why it would not work
> for you. If you use and don't use the option are the exact same counts
> listed for all the events in the -log_view ?
>
>
> I run only a single MPI rank with a single GPU, so profiling is
> straighforward.
>
> Best,
> Paul Große-Bley
>
> On Wednesday, February 22, 2023 18:24 CET, Mark Adams <mfadams at lbl.gov>
> wrote:
>
>
>
>
> On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley <
> paul.grosse-bley at ziti.uni-heidelberg.de> wrote:
>
>> Hi Barry,
>>
>> after using VecCUDAGetArray to initialize the RHS, that kernel still gets
>> called as part of KSPSolve instead of KSPSetup, but its runtime is way less
>> significant than the cudaMemcpy before, so I guess I will leave it like
>> this. Other than that I kept the code like in my first message in this
>> thread (as you wrote, benchmark_ksp.c is not well suited for PCMG).
>>
>> The profiling results for PCMG and PCAMG look as I would expect them to
>> look, i.e. one can nicely see the GPU load/kernel runtimes going down and
>> up again for each V-cycle.
>>
>> I was wondering about -pc_mg_multiplicative_cycles as it does not seem to
>> make any difference. I would have expected to be able to increase the
>> number of V-cycles per KSP iteration, but I keep seeing just a single
>> V-cycle when changing the option (using PCMG).
>
>
> How are you seeing this?
> You might try -log_trace to see if you get two V cycles.
>
>
>>
>> When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess
>> between bench iterations to reset the solution vector does not seem to work
>> as the residual keeps shrinking. Is this a bug? Any advice for working
>> around this?
>>
>
>
> Looking at the doc
> https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/
> you use this with  KSPSetComputeRHS.
>
> In src/snes/tests/ex13.c I just zero out the solution vector.
>
>
>> The profile for BoomerAMG also doesn't really show the V-cycle behavior
>> of the other implementations. Most of the runtime seems to go into calls to
>> cusparseDcsrsv which might happen at the different levels, but the runtime
>> of these kernels doesn't show the V-cycle pattern. According to the output
>> with -pc_hypre_boomeramg_print_statistics it is doing the right thing
>> though, so I guess it is alright (and if not, this is probably the wrong
>> place to discuss it).
>
>
>> When using PCAMGX, I see two PCApply (each showing a nice V-cycle
>> behavior) calls in KSPSolve (three for the very first KSPSolve) while
>> expecting just one. Each KSPSolve should do a single preconditioned
>> Richardson iteration. Why is the preconditioner applied multiple times here?
>>
>
>
> Again, not sure what "see" is, but PCAMGX is pretty new and has not been
> used much.
> Note some KSP methods apply to the PC before the iteration.
>
> Mark
>
>
>> Thank you,
>> Paul Große-Bley
>>
>>
>> On Monday, February 06, 2023 20:05 CET, Barry Smith <bsmith at petsc.dev>
>> wrote:
>>
>>
>>
>>
>>
>>  It should not crash, take a look at the test cases at the bottom of the
>> file. You are likely correct if the code, unfortunately, does use
>> DMCreateMatrix() it will not work out of the box for geometric multigrid.
>> So it might be the wrong example for you.
>>
>>   I don't know what you mean about clever. If you simply set the solution
>> to zero at the beginning of the loop then it will just do the same solve
>> multiple times. The setup should not do much of anything after the first
>> solver.  Thought usually solves are big enough that one need not run solves
>> multiple times to get a good understanding of their performance.
>>
>>
>>
>>
>>
>>
>>
>> On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley <
>> paul.grosse-bley at ziti.uni-heidelberg.de> wrote:
>>
>> Hi Barry,
>>
>> src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting
>> point, thank you! Sadly I get a segfault when executing that example with
>> PCMG and more than one level, i.e. with the minimal args:
>>
>> $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
>> ===========================================
>> Test: KSP performance - Poisson
>>     Input matrix: 27-pt finite difference stencil
>>     -n 100
>>     DoFs = 1000000
>>     Number of nonzeros = 26463592
>>
>> Step1  - creating Vecs and Mat...
>> Step2a - running PCSetUp()...
>> [0]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>> probably memory access out of range
>> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
>> https://petsc.org/release/faq/
>> [0]PETSC ERROR: or try
>> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA
>> systems to find memory corruption errors
>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
>> and run
>> [0]PETSC ERROR: to get more information on the crash.
>> [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is
>> causing the crash.
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 59.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>>
>> As the matrix is not created using DMDACreate3d I expected it to fail due
>> to the missing geometric information, but I expected it to fail more
>> gracefully than with a segfault.
>> I will try to combine bench_kspsolve.c with ex45.c to get easy MG
>> preconditioning, especially since I am interested in the 7pt stencil for
>> now.
>>
>> Concerning my benchmarking loop from before: Is it generally discouraged
>> to do this for KSPSolve due to PETSc cleverly/lazily skipping some of the
>> work when doing the same solve multiple times or are the solves not
>> iterated in bench_kspsolve.c (while the MatMuls are with -matmult) just to
>> keep the runtime short?
>>
>> Thanks,
>> Paul
>>
>> On Monday, February 06, 2023 17:04 CET, Barry Smith <bsmith at petsc.dev>
>> wrote:
>>
>>
>>
>>
>>
>>   Paul,
>>
>>    I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to
>> be used for simple benchmarking.
>>
>>    You can use VecCudaGetArray() to access the GPU memory of the vector
>> and then call a CUDA kernel to compute the right hand side vector directly
>> on the GPU.
>>
>>   Barry
>>
>>
>>
>> On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley <
>> paul.grosse-bley at ziti.uni-heidelberg.de> wrote:
>>
>> Hi,
>>
>> I want to compare different implementations of multigrid solvers for
>> Nvidia GPUs using the poisson problem (starting from ksp tutorial example
>> ex45.c).
>> Therefore I am trying to get runtime results comparable to hpgmg-cuda
>> <https://bitbucket.org/nsakharnykh/hpgmg-cuda/src/master/>
>> (finite-volume), i.e. using multiple warmup and measurement solves and
>> avoiding measuring setup time.
>> For now I am using -log_view with added stages:
>>
>> PetscLogStageRegister("Solve Bench", &solve_bench_stage);
>>   for (int i = 0; i < BENCH_SOLVES; i++) {
>>     PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL));
>> // reset x
>>     PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve
>>     PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that
>> everything is done
>>
>>     PetscLogStagePush(solve_bench_stage);
>>     PetscCall(KSPSolve(ksp, NULL, NULL));
>>     PetscLogStagePop();
>>   }
>>
>> This snippet is preceded by a similar loop for warmup.
>>
>> When profiling this using Nsight Systems, I see that the very first solve
>> is much slower which mostly correspods to H2D (host to device) copies and
>> e.g. cuBLAS setup (maybe also paging overheads as mentioned in the docs
>> <https://petsc.org/release/docs/manual/profiling/#accurate-profiling-and-paging-overheads>,
>> but probably insignificant in this case). The following solves have some
>> overhead at the start from a H2D copy of a vector (the RHS I guess, as the
>> copy is preceeded by a matrix-vector product) in the first MatResidual call
>> (callchain:
>> KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My
>> interpretation of the profiling results (i.e. cuBLAS calls) is that that
>> vector is overwritten with the residual in Daxpy and therefore has to be
>> copied again for the next iteration.
>>
>> Is there an elegant way of avoiding that H2D copy? I have seen some
>> examples on constructing matrices directly on the GPU, but nothing about
>> vectors. Any further tips for benchmarking (vs profiling) PETSc solvers? At
>> the moment I am using jacobi as smoother, but I would like to have a CUDA
>> implementation of SOR instead. Is there a good way of achieving that, e.g.
>> using PCHYPREs boomeramg with a single level and "SOR/Jacobi"-smoother  as
>> smoother in PCMG? Or is the overhead from constantly switching between
>> PETSc and hypre too big?
>>
>> Thanks,
>> Paul
>>
>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230222/c1168978/attachment-0001.html>