[petsc-users] Running CG with HYPRE AMG preconditioner in AMD GPUs
Mark Adams
mfadams at lbl.gov
Tue Mar 19 16:15:56 CDT 2024
You want: -mat_type aijhipsparse
On Tue, Mar 19, 2024 at 5:06 PM Vanella, Marcos (Fed) <
marcos.vanella at nist.gov> wrote:
> Hi Mark, thanks. I'll try your suggestions. So, I would keep -mat_type
> mpiaijkokkos but -vec_type hip as runtime options?
> Thanks,
> Marcos
> ------------------------------
> *From:* Mark Adams <mfadams at lbl.gov>
> *Sent:* Tuesday, March 19, 2024 4:57 PM
> *To:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
> *Cc:* PETSc users list <petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] Running CG with HYPRE AMG preconditioner in
> AMD GPUs
>
> [keep on list]
>
> I have little experience with running hypre on GPUs but others might have
> more.
>
> 1M dogs/node is not a lot and NVIDIA has larger L1 cache and more mature
> compilers, etc. so it is not surprising that NVIDIA is faster.
> I suspect the gap would narrow with a larger problem.
>
> Also, why are you using Kokkos? It should not make a difference but you
> could check easily. Just use -vec_type hip with your current code.
>
> You could also test with GAMG, -pc_type gamg
>
> Mark
>
>
> On Tue, Mar 19, 2024 at 4:12 PM Vanella, Marcos (Fed) <
> marcos.vanella at nist.gov> wrote:
>
> Hi Mark, I run a canonical test we have to time our code. It is a propane
> fire on a burner within a box with around 1 million cells.
> I split the problem in 4 GPUS, single node, both in Polaris and Frontier.
> I compiled PETSc with gnu and HYPRE being downloaded and the following
> configure options:
>
>
> - Polaris:
> $./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3"
> FCOPTFLAGS="-O3" CUDAOPTFLAGS="-O3" --with-debugging=0
> --download-suitesparse --download-hypre --with-cuda --with-cc=cc
> --with-cxx=CC --with-fc=ftn --with-cudac=nvcc --with-cuda-arch=80
> --download-cmake
>
>
>
> - Frontier:
> $./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3"
> FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0 --with-cc=cc
> --with-cxx=CC --with-fc=ftn --with-hip --with-hipc=hipcc
> --LIBS="-L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a}
> ${PE_MPICH_GTL_LIBS_amd_gfx90a}" --download-kokkos
> --download-kokkos-kernels --download-suitesparse --download-hypre
> --download-cmake
>
>
> Our code was compiled also with gnu compilers and -O3 flag. I used latest
> (from this week) PETSc repo update. These are the timings for the test case:
>
>
> - 8 meshes + 1Million cells case, 8 MPI processes, 4 GPUS, 2 MPI Procs
> per GPU, 1 sec run time (~580 time steps, ~1160 Poisson solves):
>
>
> System Poisson Solver GPU Implementation
> Poisson Wall time (sec) Total Wall time (sec)
> Polaris CG + HYPRE PC CUDA
> 80 287
> Frontier CG + HYPRE PC Kokkos + HIP
> 158 401
>
> It is interesting to see that the Poisson solves take twice the time in
> Frontier than in Polaris.
> Do you have experience on running HYPRE AMG on these machines? Is this
> difference between the CUDA implementation and Kokkos-kernels to be
> expected?
>
> I can run the case in both computers with the log flags you suggest. Might
> give more information on where the differences are.
>
> Thank you for your time,
> Marcos
>
>
> ------------------------------
> *From:* Mark Adams <mfadams at lbl.gov>
> *Sent:* Tuesday, March 5, 2024 2:41 PM
> *To:* Vanella, Marcos (Fed) <marcos.vanella at nist.gov>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] Running CG with HYPRE AMG preconditioner in
> AMD GPUs
>
> You can run with -log_view_gpu_time to get rid of the nans and get more
> data.
>
> You can run with -ksp_view to get more info on the solver and send that
> output.
>
> -options_left is also good to use so we can see what parameters you used.
>
> The last 100 in this row:
>
> KSPSolve 1197 0.0 2.0291e+02 0.0 2.55e+11 0.0 3.9e+04 8.0e+04
> 3.1e+04 12 100 100 100 49 12 100 100 100 98 2503 -nan 0 1.80e-05
> 0 0.00e+00 100
>
> tells us that all the flops were logged on GPUs.
>
> You do need at least 100K equations per GPU to see speedup, so don't worry
> about small problems.
>
> Mark
>
>
>
>
> On Tue, Mar 5, 2024 at 12:52 PM Vanella, Marcos (Fed) via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
>
> Hi all, I compiled the latest PETSc source in Frontier using gcc+kokkos
> and hip options: ./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3"
> FOPTFLAGS="-O3" FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Hi all, I compiled the latest PETSc source in Frontier using gcc+kokkos
> and hip options:
>
> ./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3"
> FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0 --with-cc=cc
> --with-cxx=CC --with-fc=ftn --with-hip --with-hipc=hipcc
> --LIBS="-L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a}
> ${PE_MPICH_GTL_LIBS_amd_gfx90a}" --download-kokkos
> --download-kokkos-kernels --download-suitesparse --download-hypre
> --download-cmake
>
> and have started testing our code solving a Poisson linear system with CG
> + HYPRE preconditioner. Timings look rather high compared to compilations
> done on other machines that have NVIDIA cards. They are also not changing
> when using more than one GPU for the simple test I doing.
> Does anyone happen to know if HYPRE has an hip GPU implementation for
> Boomer AMG and is it compiled when configuring PETSc?
>
> Thanks!
>
> Marcos
>
>
> PS: This is what I see on the log file (-log_view) when running the case
> with 2 GPUs in the node:
>
>
> ------------------------------------------------------------------ PETSc
> Performance Summary:
> ------------------------------------------------------------------
>
> /ccs/home/vanellam/Firemodels_fork/fds/Build/mpich_gnu_frontier/fds_mpich_gnu_frontier
> on a arch-linux-frontier-opt-gcc named frontier04119 with 4 processors, by
> vanellam Tue Mar 5 12:42:29 2024
> Using Petsc Development GIT revision: v3.20.5-713-gabdf6bc0fcf GIT Date:
> 2024-03-05 01:04:54 +0000
>
> Max Max/Min Avg Total
> Time (sec): 8.368e+02 1.000 8.368e+02
> Objects: 0.000e+00 0.000 0.000e+00
> Flops: 2.546e+11 0.000 1.270e+11 5.079e+11
> Flops/sec: 3.043e+08 0.000 1.518e+08 6.070e+08
> MPI Msg Count: 1.950e+04 0.000 9.748e+03 3.899e+04
> MPI Msg Len (bytes): 1.560e+09 0.000 7.999e+04 3.119e+09
> MPI Reductions: 6.331e+04 2877.545
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N
> --> 2N flops
> and VecAXPY() for complex vectors of length N
> --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages
> --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total Count
> %Total Avg %Total Count %Total
> 0: Main Stage: 8.3676e+02 100.0% 5.0792e+11 100.0% 3.899e+04
> 100.0% 7.999e+04 100.0% 3.164e+04 50.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flop: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> AvgLen: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %F - percent flop in this
> phase
> %M - percent messages in this phase %L - percent message lengths
> in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU
> time over all processors)
> CpuToGpu Count: total number of CPU to GPU copies per processor
> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
> processor)
> GpuToCpu Count: total number of GPU to CPU copies per processor
> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
> processor)
> GPU %F: percent flops on GPU in this event
>
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flop
> --- Global --- --- Stage ---- Total GPU - CpuToGpu - -
> GpuToCpu - GPU
> Max Ratio Max Ratio Max Ratio Mess AvgLen
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size
> Count Size %F
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSided 1201 0.0 nan nan 0.00e+00 0.0 2.0e+00 4.0e+00
> 6.0e+02 0 0 0 0 1 0 0 0 0 2 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> BuildTwoSidedF 1200 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+02 0 0 0 0 1 0 0 0 0 2 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> MatMult 19494 0.0 nan nan 1.35e+11 0.0 3.9e+04 8.0e+04
> 0.0e+00 7 53 100 100 0 7 53 100 100 0 -nan -nan 0 1.80e-05
> 0 0.00e+00 100
> MatConvert 3 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.5e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> MatAssemblyBegin 2 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> MatAssemblyEnd 2 0.0 nan nan 0.00e+00 0.0 4.0e+00 2.0e+04
> 3.5e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> VecTDot 41382 0.0 nan nan 4.14e+10 0.0 0.0e+00 0.0e+00
> 2.1e+04 0 16 0 0 33 0 16 0 0 65 -nan -nan 0 0.00e+00 0
> 0.00e+00 100
> VecNorm 20691 0.0 nan nan 2.07e+10 0.0 0.0e+00 0.0e+00
> 1.0e+04 0 8 0 0 16 0 8 0 0 33 -nan -nan 0 0.00e+00 0
> 0.00e+00 100
> VecCopy 2394 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> VecSet 21888 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> VecAXPY 38988 0.0 nan nan 3.90e+10 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 15 0 0 0 0 15 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 100
> VecAYPX 18297 0.0 nan nan 1.83e+10 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 7 0 0 0 0 7 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 100
> VecAssemblyBegin 1197 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+02 0 0 0 0 1 0 0 0 0 2 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> VecAssemblyEnd 1197 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> VecScatterBegin 19494 0.0 nan nan 0.00e+00 0.0 3.9e+04 8.0e+04
> 0.0e+00 0 0 100 100 0 0 0 100 100 0 -nan -nan 0 1.80e-05
> 0 0.00e+00 0
> VecScatterEnd 19494 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> SFSetGraph 1 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> SFSetUp 1 0.0 nan nan 0.00e+00 0.0 4.0e+00 2.0e+04
> 5.0e-01 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> SFPack 19494 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 1.80e-05 0
> 0.00e+00 0
> SFUnpack 19494 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> KSPSetUp 1 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> KSPSolve 1197 0.0 2.0291e+02 0.0 2.55e+11 0.0 3.9e+04 8.0e+04
> 3.1e+04 12 100 100 100 49 12 100 100 100 98 2503 -nan 0 1.80e-05
> 0 0.00e+00 100
> PCSetUp 1 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.5e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
> PCApply 20691 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 5 0 0 0 0 5 0 0 0 0 -nan -nan 0 0.00e+00 0
> 0.00e+00 0
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Object Type Creations Destructions. Reports information only
> for process 0.
>
> --- Event Stage 0: Main Stage
>
> Matrix 7 3
> Vector 7 1
> Index Set 2 2
> Star Forest Graph 1 0
> Krylov Solver 1 0
> Preconditioner 1 0
>
> ========================================================================================================================
> Average time to get PetscTime(): 3.01e-08
> Average time for MPI_Barrier(): 3.8054e-06
> Average time for zero size MPI_Send(): 7.101e-06
> #PETSc Option Table entries:
> -log_view # (source: command line)
> -mat_type mpiaijkokkos # (source: command line)
> -vec_type kokkos # (source: command line)
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3
> FCOPTFLAGS=-O3 HIPOPTFLAGS=-O3 --with-debugging=0 --with-cc=cc
> --with-cxx=CC --with-fc=ftn --with-hip --with-hipc=hipcc
> --LIBS="-L/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib -lmpi
> -L/opt/cray/pe/mpich/8.1.23/gtl/lib -lmpi_gtl_hsa" --download-kokkos
> --download-kokkos-kernels --download-suitesparse --download-hypre
> --download-cmake
> -----------------------------------------
> Libraries compiled on 2024-03-05 17:04:36 on login08
> Machine characteristics:
> Linux-5.14.21-150400.24.46_12.0.83-cray_shasta_c-x86_64-with-glibc2.3.4
> Using PETSc directory: /autofs/nccs-svm1_home1/vanellam/Software/petsc
> Using PETSc arch: arch-linux-frontier-opt-gcc
> -----------------------------------------
>
> Using C compiler: cc -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas
> -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector
> -fvisibility=hidden -O3
> Using Fortran compiler: ftn -fPIC -Wall -ffree-line-length-none
> -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O3
> -----------------------------------------
>
> Using include paths:
> -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/include
> -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/include
> -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/include/suitesparse
> -I/opt/rocm-5.4.0/include
> -----------------------------------------
>
> Using C linker: cc
> Using Fortran linker: ftn
> Using libraries:
> -Wl,-rpath,/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
> -L/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
> -lpetsc
> -Wl,-rpath,/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
> -L/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
> -Wl,-rpath,/opt/rocm-5.4.0/lib -L/opt/rocm-5.4.0/lib
> -Wl,-rpath,/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib
> -L/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib
> -Wl,-rpath,/opt/cray/pe/mpich/8.1.23/gtl/lib
> -L/opt/cray/pe/mpich/8.1.23/gtl/lib -Wl,-rpath,/opt/cray/pe/libsci/
> 22.12.1.1/GNU/9.1/x86_64/lib -L/opt/cray/pe/libsci/
> 22.12.1.1/GNU/9.1/x86_64/lib
> -Wl,-rpath,/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/darshan-runtime-3.4.0-ftq5gccg3qjtyh5xeo2bz4wqkjayjhw3/lib
> -L/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/darshan-runtime-3.4.0-ftq5gccg3qjtyh5xeo2bz4wqkjayjhw3/lib
> -Wl,-rpath,/opt/cray/pe/dsmml/0.2.2/dsmml/lib
> -L/opt/cray/pe/dsmml/0.2.2/dsmml/lib -Wl,-rpath,/opt/cray/pe/pmi/6.1.8/lib
> -L/opt/cray/pe/pmi/6.1.8/lib
> -Wl,-rpath,/opt/cray/xpmem/2.6.2-2.5_2.22__gd067c3f.shasta/lib64
> -L/opt/cray/xpmem/2.6.2-2.5_2.22__gd067c3f.shasta/lib64
> -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib/gcc/x86_64-suse-linux/12.2.0
> -L/opt/cray/pe/gcc/12.2.0/snos/lib/gcc/x86_64-suse-linux/12.2.0
> -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib64
> -L/opt/cray/pe/gcc/12.2.0/snos/lib64 -Wl,-rpath,/opt/rocm-5.4.0/llvm/lib
> -L/opt/rocm-5.4.0/llvm/lib -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib
> -L/opt/cray/pe/gcc/12.2.0/snos/lib -lHYPRE -lspqr -lumfpack -lklu -lcholmod
> -lamd -lkokkoskernels -lkokkoscontainers -lkokkoscore -lkokkossimd
> -lhipsparse -lhipblas -lhipsolver -lrocsparse -lrocsolver -lrocblas
> -lrocrand -lamdhip64 -lmpi -lmpi_gtl_hsa -ldarshan -lz -ldl -lxpmem
> -lgfortran -lm -lmpifort_gnu_91 -lmpi_gnu_91 -lsci_gnu_82_mpi -lsci_gnu_82
> -ldsmml -lpmi -lpmi2 -lgfortran -lquadmath -lpthread -lm -lgcc_s -lstdc++
> -lquadmath -lmpi -lmpi_gtl_hsa
> -----------------------------------------
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240319/b9c9c36b/attachment-0001.html>
More information about the petsc-users
mailing list