[petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

Paul Grosse-Bley paul.grosse-bley at ziti.uni-heidelberg.de
Mon Feb 6 09:57:51 CST 2023


Hi,

I want to compare different implementations of multigrid solvers for Nvidia GPUs using the poisson problem (starting from ksp tutorial example ex45.c).
Therefore I am trying to get runtime results comparable to hpgmg-cuda (finite-volume), i.e. using multiple warmup and measurement solves and avoiding measuring setup time.
For now I am using -log_view with added stages:

PetscLogStageRegister("Solve Bench", &solve_bench_stage);
  for (int i = 0; i < BENCH_SOLVES; i++) {
    PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL)); // reset x
    PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve
    PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that everything is done

    PetscLogStagePush(solve_bench_stage);
    PetscCall(KSPSolve(ksp, NULL, NULL));
    PetscLogStagePop();
  }

This snippet is preceded by a similar loop for warmup.

When profiling this using Nsight Systems, I see that the very first solve is much slower which mostly correspods to H2D (host to device) copies and e.g. cuBLAS setup (maybe also paging overheads as mentioned in the docs, but probably insignificant in this case). The following solves have some overhead at the start from a H2D copy of a vector (the RHS I guess, as the copy is preceeded by a matrix-vector product) in the first MatResidual call (callchain: KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My interpretation of the profiling results (i.e. cuBLAS calls) is that that vector is overwritten with the residual in Daxpy and therefore has to be copied again for the next iteration.

Is there an elegant way of avoiding that H2D copy? I have seen some examples on constructing matrices directly on the GPU, but nothing about vectors. Any further tips for benchmarking (vs profiling) PETSc solvers? At the moment I am using jacobi as smoother, but I would like to have a CUDA implementation of SOR instead. Is there a good way of achieving that, e.g. using PCHYPREs boomeramg with a single level and "SOR/Jacobi"-smoother  as smoother in PCMG? Or is the overhead from constantly switching between PETSc and hypre too big?

Thanks,
Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230206/d8c5ad70/attachment-0001.html>


More information about the petsc-users mailing list