[petsc-users] Trying to understand -log_view when using HIP kernels (ex34)
Junchao Zhang
junchao.zhang at gmail.com
Fri Jan 19 17:58:07 CST 2024
I reproduced this HIPSPARSE_STATUS_INVALID_VALUE error, but have not yet
found obvious input argument errors for this hipsparse call.
On Fri, Jan 19, 2024 at 2:18 PM Barry Smith <bsmith at petsc.dev> wrote:
>
> Junchao
>
> I run the following on the CI machine, why does this happen? With
> trivial solver options it runs ok.
>
> bsmith at petsc-gpu-02:/scratch/bsmith/petsc/src/ksp/ksp/tutorials$ ./ex34
> -da_grid_x 192 -da_grid_y 192 -da_grid_z 192 -dm_mat_type seqaijhipsparse
> -dm_vec_type seqhip -ksp_max_it 10 -ksp_monitor -ksp_type richardson
> -ksp_view -log_view -mg_coarse_ksp_max_it 2 -mg_coarse_ksp_type richardson
> -mg_coarse_pc_type none -mg_levels_ksp_type richardson -mg_levels_pc_type
> none -options_left -pc_mg_levels 3 -pc_mg_log -pc_type mg
>
> *[0]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------*
>
> [0]PETSC ERROR: GPU error
>
> [0]PETSC ERROR: hipSPARSE errorcode 3 (HIPSPARSE_STATUS_INVALID_VALUE)
>
> [0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the
> program crashed before usage or a spelling mistake, etc!
>
> [0]PETSC ERROR: Option left: name:-options_left (no value) source:
> command line
>
> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
>
> [0]PETSC ERROR: Petsc Release Version 3.20.3, unknown
>
> [0]PETSC ERROR: ./ex34 on a named petsc-gpu-02 by bsmith Fri Jan 19
> 14:15:20 2024
>
> [0]PETSC ERROR: Configure options
> --package-prefix-hash=/home/bsmith/petsc-hash-pkgs --with-make-np=24
> --with-make-test-np=8 --with-hipc=/opt/rocm-5.4.3/bin/hipcc
> --with-hip-dir=/opt/rocm-5.4.3 COPTFLAGS="-g -O" FOPTFLAGS="-g -O"
> CXXOPTFLAGS="-g -O" HIPOPTFLAGS="-g -O" --with-cuda=0 --with-hip=1
> --with-precision=double --with-clanguage=c --download-kokkos
> --download-kokkos-kernels --download-hypre --download-magma
> --with-magma-fortran-bindings=0 --download-mfem --download-metis
> --with-strict-petscerrorcode PETSC_ARCH=arch-ci-linux-hip-double
>
> [0]PETSC ERROR: #1 MatMultAddKernel_SeqAIJHIPSPARSE() at
> /scratch/bsmith/petsc/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3131
>
> [0]PETSC ERROR: #2 MatMultAdd_SeqAIJHIPSPARSE() at
> /scratch/bsmith/petsc/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3004
>
> [0]PETSC ERROR: #3 MatMultAdd() at
> /scratch/bsmith/petsc/src/mat/interface/matrix.c:2770
>
> [0]PETSC ERROR: #4 MatInterpolateAdd() at
> /scratch/bsmith/petsc/src/mat/interface/matrix.c:8603
>
> [0]PETSC ERROR: #5 PCMGMCycle_Private() at
> /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:87
>
> [0]PETSC ERROR: #6 PCMGMCycle_Private() at
> /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:83
>
> [0]PETSC ERROR: #7 PCApply_MG_Internal() at
> /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:611
>
> [0]PETSC ERROR: #8 PCApply_MG() at
> /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:633
>
> [0]PETSC ERROR: #9 PCApply() at
> /scratch/bsmith/petsc/src/ksp/pc/interface/precon.c:498
>
> [0]PETSC ERROR: #10 KSP_PCApply() at
> /scratch/bsmith/petsc/include/petsc/private/kspimpl.h:383
>
> [0]PETSC ERROR: #11 KSPSolve_Richardson() at
> /scratch/bsmith/petsc/src/ksp/ksp/impls/rich/rich.c:106
>
> [0]PETSC ERROR: #12 KSPSolve_Private() at
> /scratch/bsmith/petsc/src/ksp/ksp/interface/itfunc.c:906
>
> [0]PETSC ERROR: #13 KSPSolve() at
> /scratch/bsmith/petsc/src/ksp/ksp/interface/itfunc.c:1079
>
> [0]PETSC ERROR: #14 main() at ex34.c:52
>
> [0]PETSC ERROR: PETSc Option Table entries:
>
> Dave,
>
> Trying to debug the 7% now, but having trouble running, as you see
> above.
>
>
>
> On Jan 19, 2024, at 3:02 PM, Dave May <dave.mayhem23 at gmail.com> wrote:
>
> Thank you Barry and Junchao for these explanations. I'll turn on
> -log_view_gpu_time.
>
> Do either of you have any thoughts regarding why the percentage of flop's
> being reported on the GPU is not 100% for MGSmooth Level {0,1,2} for this
> solver configuration?
>
> This number should have nothing to do with timings as it reports the ratio
> of operations performed on the GPU and CPU, presumably obtained from
> PetscLogFlops() and PetscLogGpuFlops().
>
> Cheers,
> Dave
>
> On Fri, 19 Jan 2024 at 11:39, Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>> Try to also add -log_view_gpu_time,
>> https://petsc.org/release/manualpages/Profiling/PetscLogGpuTime/
>>
>> --Junchao Zhang
>>
>>
>> On Fri, Jan 19, 2024 at 11:35 AM Dave May <dave.mayhem23 at gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I am trying to understand the logging information associated with the
>>> %flops-performed-on-the-gpu reported by -log_view when running
>>> src/ksp/ksp/tutorials/ex34
>>> with the following options
>>> -da_grid_x 192
>>> -da_grid_y 192
>>> -da_grid_z 192
>>> -dm_mat_type seqaijhipsparse
>>> -dm_vec_type seqhip
>>> -ksp_max_it 10
>>> -ksp_monitor
>>> -ksp_type richardson
>>> -ksp_view
>>> -log_view
>>> -mg_coarse_ksp_max_it 2
>>> -mg_coarse_ksp_type richardson
>>> -mg_coarse_pc_type none
>>> -mg_levels_ksp_type richardson
>>> -mg_levels_pc_type none
>>> -options_left
>>> -pc_mg_levels 3
>>> -pc_mg_log
>>> -pc_type mg
>>>
>>> This config is not intended to actually solve the problem, rather it is
>>> a stripped down set of options designed to understand what parts of the
>>> smoothers are being executed on the GPU.
>>>
>>> With respect to the log file attached, my first set of questions related
>>> to the data reported under "Event Stage 2: MG Apply".
>>>
>>> [1] Why is the log littered with nan's?
>>> * I don't understand how and why "GPU Mflop/s" should be reported as nan
>>> when a value is given for "GPU %F" (see MatMult for example).
>>>
>>> * For events executed on the GPU, I assume the column "Time (sec)"
>>> relates to "CPU execute time", this would explain why we see a nan in "Time
>>> (sec)" for MatMult.
>>> If my assumption is correct, how should I interpret the column "Flop
>>> (Max)" which is showing 1.92e+09?
>>> I would assume of "Time (sec)" relates to the CPU then "Flop (Max)"
>>> should also relate to CPU and GPU flops would be logged in "GPU Mflop/s"
>>>
>>> [2] More curious is that within "Event Stage 2: MG Apply" KSPSolve,
>>> MGSmooth Level 0, MGSmooth Level 1, MGSmooth Level 2 all report "GPU %F" as
>>> 93. I believe this value should be 100 as the smoother (and coarse grid
>>> solver) are configured as richardson(2)+none and thus should run entirely
>>> on the GPU.
>>> Furthermore, when one inspects all events listed under "Event Stage 2:
>>> MG Apply" those events which do flops correctly report "GPU %F" as 100.
>>> And the events showing "GPU %F" = 0 such as
>>> MatHIPSPARSCopyTo, VecCopy, VecSet, PCApply, DCtxSync
>>> don't do any flops (on the CPU or GPU) - which is also correct
>>> (although non GPU events should show nan??)
>>>
>>> Hence I am wondering what is the explanation for the missing 7% from
>>> "GPU %F" for KSPSolve and MGSmooth {0,1,2}??
>>>
>>> Does anyone understand this -log_view, or can explain to me how to
>>> interpret it?
>>>
>>> It could simply be that:
>>> a) something is messed up with -pc_mg_log
>>> b) something is messed up with the PETSc build
>>> c) I am putting too much faith in -log_view and should profile the code
>>> differently.
>>>
>>> Either way I'd really like to understand what is going on.
>>>
>>>
>>> Cheers,
>>> Dave
>>>
>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240119/c40a9430/attachment.html>
More information about the petsc-users
mailing list