[petsc-users] Trying to understand -log_view when using HIP kernels (ex34)

Barry Smith bsmith at petsc.dev
Fri Jan 19 14:17:48 CST 2024


  Junchao

    I run the following on the CI machine, why does this happen? With trivial solver options it runs ok.

bsmith at petsc-gpu-02:/scratch/bsmith/petsc/src/ksp/ksp/tutorials$ ./ex34 -da_grid_x 192 -da_grid_y 192 -da_grid_z 192 -dm_mat_type seqaijhipsparse -dm_vec_type seqhip -ksp_max_it 10 -ksp_monitor -ksp_type richardson -ksp_view -log_view -mg_coarse_ksp_max_it 2 -mg_coarse_ksp_type richardson -mg_coarse_pc_type none -mg_levels_ksp_type richardson -mg_levels_pc_type none -options_left -pc_mg_levels 3 -pc_mg_log -pc_type mg
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: GPU error
[0]PETSC ERROR: hipSPARSE errorcode 3 (HIPSPARSE_STATUS_INVALID_VALUE)
[0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc!
[0]PETSC ERROR:   Option left: name:-options_left (no value) source: command line
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.20.3, unknown 
[0]PETSC ERROR: ./ex34 on a  named petsc-gpu-02 by bsmith Fri Jan 19 14:15:20 2024
[0]PETSC ERROR: Configure options --package-prefix-hash=/home/bsmith/petsc-hash-pkgs --with-make-np=24 --with-make-test-np=8 --with-hipc=/opt/rocm-5.4.3/bin/hipcc --with-hip-dir=/opt/rocm-5.4.3 COPTFLAGS="-g -O" FOPTFLAGS="-g -O" CXXOPTFLAGS="-g -O" HIPOPTFLAGS="-g -O" --with-cuda=0 --with-hip=1 --with-precision=double --with-clanguage=c --download-kokkos --download-kokkos-kernels --download-hypre --download-magma --with-magma-fortran-bindings=0 --download-mfem --download-metis --with-strict-petscerrorcode PETSC_ARCH=arch-ci-linux-hip-double
[0]PETSC ERROR: #1 MatMultAddKernel_SeqAIJHIPSPARSE() at /scratch/bsmith/petsc/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3131
[0]PETSC ERROR: #2 MatMultAdd_SeqAIJHIPSPARSE() at /scratch/bsmith/petsc/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3004
[0]PETSC ERROR: #3 MatMultAdd() at /scratch/bsmith/petsc/src/mat/interface/matrix.c:2770
[0]PETSC ERROR: #4 MatInterpolateAdd() at /scratch/bsmith/petsc/src/mat/interface/matrix.c:8603
[0]PETSC ERROR: #5 PCMGMCycle_Private() at /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:87
[0]PETSC ERROR: #6 PCMGMCycle_Private() at /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:83
[0]PETSC ERROR: #7 PCApply_MG_Internal() at /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:611
[0]PETSC ERROR: #8 PCApply_MG() at /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:633
[0]PETSC ERROR: #9 PCApply() at /scratch/bsmith/petsc/src/ksp/pc/interface/precon.c:498
[0]PETSC ERROR: #10 KSP_PCApply() at /scratch/bsmith/petsc/include/petsc/private/kspimpl.h:383
[0]PETSC ERROR: #11 KSPSolve_Richardson() at /scratch/bsmith/petsc/src/ksp/ksp/impls/rich/rich.c:106
[0]PETSC ERROR: #12 KSPSolve_Private() at /scratch/bsmith/petsc/src/ksp/ksp/interface/itfunc.c:906
[0]PETSC ERROR: #13 KSPSolve() at /scratch/bsmith/petsc/src/ksp/ksp/interface/itfunc.c:1079
[0]PETSC ERROR: #14 main() at ex34.c:52
[0]PETSC ERROR: PETSc Option Table entries:

  Dave,

    Trying to debug the 7% now, but having trouble running, as you see above.



> On Jan 19, 2024, at 3:02 PM, Dave May <dave.mayhem23 at gmail.com> wrote:
> 
> Thank you Barry and Junchao for these explanations. I'll turn on -log_view_gpu_time.
> 
> Do either of you have any thoughts regarding why the percentage of flop's being reported on the GPU is not 100% for MGSmooth Level {0,1,2} for this solver configuration?
> 
> This number should have nothing to do with timings as it reports the ratio of operations performed on the GPU and CPU, presumably obtained from PetscLogFlops() and PetscLogGpuFlops().
> 
> Cheers,
> Dave
> 
> On Fri, 19 Jan 2024 at 11:39, Junchao Zhang <junchao.zhang at gmail.com <mailto:junchao.zhang at gmail.com>> wrote:
>> Try to also add -log_view_gpu_time, https://petsc.org/release/manualpages/Profiling/PetscLogGpuTime/
>> 
>> --Junchao Zhang
>> 
>> 
>> On Fri, Jan 19, 2024 at 11:35 AM Dave May <dave.mayhem23 at gmail.com <mailto:dave.mayhem23 at gmail.com>> wrote:
>>> Hi all,
>>> 
>>> I am trying to understand the logging information associated with the %flops-performed-on-the-gpu reported by -log_view when running 
>>>   src/ksp/ksp/tutorials/ex34
>>> with the following options
>>> -da_grid_x 192
>>> -da_grid_y 192
>>> -da_grid_z 192
>>> -dm_mat_type seqaijhipsparse
>>> -dm_vec_type seqhip
>>> -ksp_max_it 10
>>> -ksp_monitor
>>> -ksp_type richardson
>>> -ksp_view
>>> -log_view
>>> -mg_coarse_ksp_max_it 2
>>> -mg_coarse_ksp_type richardson
>>> -mg_coarse_pc_type none
>>> -mg_levels_ksp_type richardson
>>> -mg_levels_pc_type none
>>> -options_left
>>> -pc_mg_levels 3
>>> -pc_mg_log
>>> -pc_type mg
>>> 
>>> This config is not intended to actually solve the problem, rather it is a stripped down set of options designed to understand what parts of the smoothers are being executed on the GPU.
>>> 
>>> With respect to the log file attached, my first set of questions related to the data reported under "Event Stage 2: MG Apply".
>>> 
>>> [1] Why is the log littered with nan's?
>>> * I don't understand how and why "GPU Mflop/s" should be reported as nan when a value is given for "GPU %F" (see MatMult for example).
>>> 
>>> * For events executed on the GPU, I assume the column "Time (sec)" relates to "CPU execute time", this would explain why we see a nan in "Time (sec)" for MatMult.
>>> If my assumption is correct, how should I interpret the column "Flop (Max)" which is showing 1.92e+09? 
>>> I would assume of "Time (sec)" relates to the CPU then "Flop (Max)" should also relate to CPU and GPU flops would be logged in "GPU Mflop/s"
>>> 
>>> [2] More curious is that within "Event Stage 2: MG Apply" KSPSolve, MGSmooth Level 0, MGSmooth Level 1, MGSmooth Level 2 all report "GPU %F" as 93. I believe this value should be 100 as the smoother (and coarse grid solver) are configured as richardson(2)+none and thus should run entirely on the GPU. 
>>> Furthermore, when one inspects all events listed under "Event Stage 2: MG Apply" those events which do flops correctly report "GPU %F" as 100. 
>>> And the events showing "GPU %F" = 0 such as 
>>>   MatHIPSPARSCopyTo, VecCopy, VecSet, PCApply, DCtxSync
>>> don't do any flops (on the CPU or GPU) - which is also correct (although non GPU events should show nan??)
>>> 
>>> Hence I am wondering what is the explanation for the missing 7% from "GPU %F" for KSPSolve and MGSmooth {0,1,2}??
>>> 
>>> Does anyone understand this -log_view, or can explain to me how to interpret it?
>>> 
>>> It could simply be that:
>>> a) something is messed up with -pc_mg_log
>>> b) something is messed up with the PETSc build
>>> c) I am putting too much faith in -log_view and should profile the code differently.
>>> 
>>> Either way I'd really like to understand what is going on.
>>> 
>>> 
>>> Cheers,
>>> Dave
>>> 
>>> 
>>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240119/6a64306f/attachment.html>


More information about the petsc-users mailing list