[petsc-dev] Kokkos/Crusher perforance

Wed Jan 26 13:32:05 CST 2022

rocgdb requires "-ggdb" in addition to "-g"

What happens if you lower AMD_LOG_LEVEL to something like 1 or 2? I was
hoping AMD_LOG_LEVEL could at least give you something like a "stacktrace"
showing what the last successful HIP/HSA call was. I believe it should also
show line numbers in the code.

On Wed, Jan 26, 2022 at 1:29 PM Mark Adams <mfadams at lbl.gov> wrote:

>
>
> On Wed, Jan 26, 2022 at 1:54 PM Justin Chang <jychang48 at gmail.com> wrote:
>
>> Couple suggestions:
>>
>> 1. Set the environment variable "export AMD_LOG_LEVEL=3" <- this will
>> tell you everything that's happening at the HIP level (memcpy's, mallocs,
>> kernel execution time, etc)
>>
>
> Humm, My reproducer uses 2 nodes and 128 processes. Don't think I could do
> much with this flood of data.
>
>
>> 2. Try rocgdb, AFAIK this is the closest "HIP variant of valgrind" that
>> we officially support.
>>
>
> rocgdb just sat there reading symbols forever. I look at your doc.
> Valgrind seem OK here.
>
>
>> There are some tricks on running this together with mpi, to which you can
>> just google "mpi with gdb". But you can see how rocgdb works here:
>> https://www.olcf.ornl.gov/wp-content/uploads/2021/04/rocgdb_hipmath_ornl_2021_v2.pdf
>>
>>
>> On Wed, Jan 26, 2022 at 9:56 AM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>   Any way to run with valgrind (or a HIP variant of valgrind)? It looks
>>> like a memory corruption issue and tracking down exactly when the
>>> corruption begins is 3/4's of the way to finding the exact cause.
>>>
>>>   Are the crashes reproducible in the same place with identical runs?
>>>
>>>
>>> On Jan 26, 2022, at 10:46 AM, Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>> I think it is an MPI bug. It works with GPU aware MPI turned off.
>>> I am sure Summit will be fine.
>>> We have had users fix this error by switching thier MPI.
>>>
>>> On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>> I don't know if this is due to bugs in petsc/kokkos backend.   See if
>>>> you can run 6 nodes (48 mpi ranks).  If it fails, then run the same problem
>>>> on Summit with 8 nodes to see if it still fails. If yes, it is likely a bug
>>>> of our own.
>>>>
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>> I am not able to reproduce this with a small problem. 2 nodes or less
>>>>> refinement works. This is from the 8 node test, the -dm_refine 5 version.
>>>>> I see that it comes from PtAP.
>>>>> This is on the fine grid. (I was thinking it could be on a reduced
>>>>> grid with idle processors, but no)
>>>>>
>>>>> [15]PETSC ERROR: Argument out of range
>>>>> [15]PETSC ERROR: Key <= 0
>>>>> [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>>>>> shooting.
>>>>> [15]PETSC ERROR: Petsc Development GIT revision:
>>>>> v3.16.3-696-g46640c56cb  GIT Date: 2022-01-25 09:20:51 -0500
>>>>> [15]PETSC ERROR:
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
>>>>> arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
>>>>> [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC
>>>>> --with-fc=ftn --with-fortran-bindings=0
>>>>> LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" --with-debugging=0
>>>>> --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g
>>>>> --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00"
>>>>> --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a
>>>>> --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0
>>>>> --download-p4est=1
>>>>> --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
>>>>> PETSC_ARCH=arch-olcf-crusher
>>>>> [15]PETSC ERROR: #1 PetscTableFind() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
>>>>> [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
>>>>> [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
>>>>> [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
>>>>> [15]PETSC ERROR: #5 MatAssemblyEnd() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
>>>>> [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices()
>>>>> at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
>>>>> [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
>>>>> [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
>>>>> [15]PETSC ERROR: #9 MatProductSymbolic() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
>>>>> [15]PETSC ERROR: #10 MatPtAP() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
>>>>> [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
>>>>> [15]PETSC ERROR: #12 PCSetUp_GAMG() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
>>>>> [15]PETSC ERROR: #13 PCSetUp() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
>>>>> [15]PETSC ERROR: #14 KSPSetUp() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:417
>>>>> [15]PETSC ERROR: #15 KSPSolve_Private() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:863
>>>>> [15]PETSC ERROR: #16 KSPSolve() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:1103
>>>>> [15]PETSC ERROR: #17 SNESSolve_KSPONLY() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/impls/ksponly/ksponly.c:51
>>>>> [15]PETSC ERROR: #18 SNESSolve() at
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4810
>>>>> [15]PETSC ERROR: #19 main() at ex13.c:169
>>>>> [15]PETSC ERROR: PETSc Option Table entries:
>>>>> [15]PETSC ERROR: -benchmark_it 10
>>>>>
>>>>> On Wed, Jan 26, 2022 at 7:26 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>>> The GPU aware MPI is dying going 1 to 8 nodes, 8 processes per node.
>>>>>> I will make a minimum reproducer. start with 2 nodes, one process on
>>>>>> each node.
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 25, 2022 at 10:19 PM Barry Smith <bsmith at petsc.dev>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>   So the MPI is killing you in going from 8 to 64. (The GPU flop
>>>>>>> rate scales almost perfectly, but the overall flop rate is only half of
>>>>>>> what it should be at 64).
>>>>>>>
>>>>>>> On Jan 25, 2022, at 9:24 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>>
>>>>>>> It looks like we have our instrumentation and job configuration in
>>>>>>> decent shape so on to scaling with AMG.
>>>>>>> In using multiple nodes I got errors with table entries not found,
>>>>>>> which can be caused by a buggy MPI, and the problem does go away when I
>>>>>>> turn GPU aware MPI off.
>>>>>>> Jed's analysis, if I have this right, is that at *0.7T* flops we
>>>>>>> are at about 35% of theoretical peal wrt memory bandwidth.
>>>>>>> I run out of memory with the next step in this study (7 levels of
>>>>>>> refinement), with 2M equations per GPU. This seems low to me and we will
>>>>>>> see if we can fix this.
>>>>>>> So this 0.7Tflops is with only 1/4 M equations so 35% is not
>>>>>>> terrible.
>>>>>>> Here are the solve times with 001, 008 and 064 nodes, and 5 or 6
>>>>>>> levels of refinement.
>>>>>>>
>>>>>>> out_001_kokkos_Crusher_5_1.txt:KSPSolve              10 1.0
>>>>>>> 1.2933e+00 1.0 4.13e+10 1.1 1.8e+05 8.4e+03 5.8e+02  3 87 86 78 48
>>>>>>> 100100100100100 248792   423857   6840 3.85e+02 6792 3.85e+02 100
>>>>>>> out_001_kokkos_Crusher_6_1.txt:KSPSolve              10 1.0
>>>>>>> 5.3667e+00 1.0 3.89e+11 1.0 2.1e+05 3.3e+04 6.7e+02  2 87 86 79 48
>>>>>>> 100100100100100 571572   *700002*   7920 1.74e+03 7920 1.74e+03 100
>>>>>>> out_008_kokkos_Crusher_5_1.txt:KSPSolve              10 1.0
>>>>>>> 1.9407e+00 1.0 4.94e+10 1.1 3.5e+06 6.2e+03 6.7e+02  5 87 86 79 47
>>>>>>> 100100100100100 1581096   3034723   7920 6.88e+02 7920 6.88e+02 100
>>>>>>> out_008_kokkos_Crusher_6_1.txt:KSPSolve              10 1.0
>>>>>>> 7.4478e+00 1.0 4.49e+11 1.0 4.1e+06 2.3e+04 7.6e+02  2 88 87 80 49
>>>>>>> 100100100100100 3798162   5557106   9367 3.02e+03 9359 3.02e+03 100
>>>>>>> out_064_kokkos_Crusher_5_1.txt:KSPSolve              10 1.0
>>>>>>> 2.4551e+00 1.0 5.40e+10 1.1 4.2e+07 5.4e+03 7.3e+02  5 88 87 80 47
>>>>>>> 100100100100100 11065887   23792978   8684 8.90e+02 8683 8.90e+02 100
>>>>>>> out_064_kokkos_Crusher_6_1.txt:KSPSolve              10 1.0
>>>>>>> 1.1335e+01 1.0 5.38e+11 1.0 5.4e+07 2.0e+04 9.1e+02  4 88 88 82 49
>>>>>>> 100100100100100 24130606   43326249   11249 4.26e+03 11249 4.26e+03 100
>>>>>>>
>>>>>>> On Tue, Jan 25, 2022 at 1:49 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>> Note that Mark's logs have been switching back and forth between
>>>>>>>>> -use_gpu_aware_mpi and changing number of ranks -- we won't have that
>>>>>>>>> information if we do manual timing hacks. This is going to be a routine
>>>>>>>>> thing we'll need on the mailing list and we need the provenance to go with
>>>>>>>>> it.
>>>>>>>>>
>>>>>>>>
>>>>>>>> GPU aware MPI crashes sometimes so to be safe, while debugging, I
>>>>>>>> had it off. It works fine here so it has been on in the last tests.
>>>>>>>> Here is a comparison.
>>>>>>>>
>>>>>>>>
>>>>>>> <tt.tar>
>>>>>>>
>>>>>>>
>>>>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220126/fad3191b/attachment.html>