[petsc-dev] Kokkos/Crusher perforance

Justin Chang jychang48 at gmail.com
Wed Jan 26 12:54:25 CST 2022


Couple suggestions:

1. Set the environment variable "export AMD_LOG_LEVEL=3" <- this will tell
you everything that's happening at the HIP level (memcpy's, mallocs, kernel
execution time, etc)
2. Try rocgdb, AFAIK this is the closest "HIP variant of valgrind" that we
officially support. There are some tricks on running this together with
mpi, to which you can just google "mpi with gdb". But you can see how
rocgdb works here:
https://www.olcf.ornl.gov/wp-content/uploads/2021/04/rocgdb_hipmath_ornl_2021_v2.pdf


On Wed, Jan 26, 2022 at 9:56 AM Barry Smith <bsmith at petsc.dev> wrote:

>
>   Any way to run with valgrind (or a HIP variant of valgrind)? It looks
> like a memory corruption issue and tracking down exactly when the
> corruption begins is 3/4's of the way to finding the exact cause.
>
>   Are the crashes reproducible in the same place with identical runs?
>
>
> On Jan 26, 2022, at 10:46 AM, Mark Adams <mfadams at lbl.gov> wrote:
>
> I think it is an MPI bug. It works with GPU aware MPI turned off.
> I am sure Summit will be fine.
> We have had users fix this error by switching thier MPI.
>
> On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>> I don't know if this is due to bugs in petsc/kokkos backend.   See if you
>> can run 6 nodes (48 mpi ranks).  If it fails, then run the same problem on
>> Summit with 8 nodes to see if it still fails. If yes, it is likely a bug of
>> our own.
>>
>> --Junchao Zhang
>>
>>
>> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> I am not able to reproduce this with a small problem. 2 nodes or less
>>> refinement works. This is from the 8 node test, the -dm_refine 5 version.
>>> I see that it comes from PtAP.
>>> This is on the fine grid. (I was thinking it could be on a reduced grid
>>> with idle processors, but no)
>>>
>>> [15]PETSC ERROR: Argument out of range
>>> [15]PETSC ERROR: Key <= 0
>>> [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>>> shooting.
>>> [15]PETSC ERROR: Petsc Development GIT revision: v3.16.3-696-g46640c56cb
>>>  GIT Date: 2022-01-25 09:20:51 -0500
>>> [15]PETSC ERROR:
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a
>>> arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022
>>> [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC
>>> --with-fc=ftn --with-fortran-bindings=0
>>> LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" --with-debugging=0
>>> --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g
>>> --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00"
>>> --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a
>>> --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0
>>> --download-p4est=1
>>> --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4
>>> PETSC_ARCH=arch-olcf-crusher
>>> [15]PETSC ERROR: #1 PetscTableFind() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131
>>> [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35
>>> [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735
>>> [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14
>>> [15]PETSC ERROR: #5 MatAssemblyEnd() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678
>>> [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267
>>> [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825
>>> [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167
>>> [15]PETSC ERROR: #9 MatProductSymbolic() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825
>>> [15]PETSC ERROR: #10 MatPtAP() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656
>>> [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87
>>> [15]PETSC ERROR: #12 PCSetUp_GAMG() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663
>>> [15]PETSC ERROR: #13 PCSetUp() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017
>>> [15]PETSC ERROR: #14 KSPSetUp() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:417
>>> [15]PETSC ERROR: #15 KSPSolve_Private() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:863
>>> [15]PETSC ERROR: #16 KSPSolve() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:1103
>>> [15]PETSC ERROR: #17 SNESSolve_KSPONLY() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/impls/ksponly/ksponly.c:51
>>> [15]PETSC ERROR: #18 SNESSolve() at
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4810
>>> [15]PETSC ERROR: #19 main() at ex13.c:169
>>> [15]PETSC ERROR: PETSc Option Table entries:
>>> [15]PETSC ERROR: -benchmark_it 10
>>>
>>> On Wed, Jan 26, 2022 at 7:26 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> The GPU aware MPI is dying going 1 to 8 nodes, 8 processes per node.
>>>> I will make a minimum reproducer. start with 2 nodes, one process on
>>>> each node.
>>>>
>>>>
>>>> On Tue, Jan 25, 2022 at 10:19 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>>   So the MPI is killing you in going from 8 to 64. (The GPU flop rate
>>>>> scales almost perfectly, but the overall flop rate is only half of what it
>>>>> should be at 64).
>>>>>
>>>>> On Jan 25, 2022, at 9:24 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>> It looks like we have our instrumentation and job configuration in
>>>>> decent shape so on to scaling with AMG.
>>>>> In using multiple nodes I got errors with table entries not found,
>>>>> which can be caused by a buggy MPI, and the problem does go away when I
>>>>> turn GPU aware MPI off.
>>>>> Jed's analysis, if I have this right, is that at *0.7T* flops we are
>>>>> at about 35% of theoretical peal wrt memory bandwidth.
>>>>> I run out of memory with the next step in this study (7 levels of
>>>>> refinement), with 2M equations per GPU. This seems low to me and we will
>>>>> see if we can fix this.
>>>>> So this 0.7Tflops is with only 1/4 M equations so 35% is not terrible.
>>>>> Here are the solve times with 001, 008 and 064 nodes, and 5 or 6
>>>>> levels of refinement.
>>>>>
>>>>> out_001_kokkos_Crusher_5_1.txt:KSPSolve              10 1.0 1.2933e+00
>>>>> 1.0 4.13e+10 1.1 1.8e+05 8.4e+03 5.8e+02  3 87 86 78 48 100100100100100
>>>>> 248792   423857   6840 3.85e+02 6792 3.85e+02 100
>>>>> out_001_kokkos_Crusher_6_1.txt:KSPSolve              10 1.0 5.3667e+00
>>>>> 1.0 3.89e+11 1.0 2.1e+05 3.3e+04 6.7e+02  2 87 86 79 48 100100100100100
>>>>> 571572   *700002*   7920 1.74e+03 7920 1.74e+03 100
>>>>> out_008_kokkos_Crusher_5_1.txt:KSPSolve              10 1.0 1.9407e+00
>>>>> 1.0 4.94e+10 1.1 3.5e+06 6.2e+03 6.7e+02  5 87 86 79 47 100100100100100
>>>>> 1581096   3034723   7920 6.88e+02 7920 6.88e+02 100
>>>>> out_008_kokkos_Crusher_6_1.txt:KSPSolve              10 1.0 7.4478e+00
>>>>> 1.0 4.49e+11 1.0 4.1e+06 2.3e+04 7.6e+02  2 88 87 80 49 100100100100100
>>>>> 3798162   5557106   9367 3.02e+03 9359 3.02e+03 100
>>>>> out_064_kokkos_Crusher_5_1.txt:KSPSolve              10 1.0 2.4551e+00
>>>>> 1.0 5.40e+10 1.1 4.2e+07 5.4e+03 7.3e+02  5 88 87 80 47 100100100100100
>>>>> 11065887   23792978   8684 8.90e+02 8683 8.90e+02 100
>>>>> out_064_kokkos_Crusher_6_1.txt:KSPSolve              10 1.0 1.1335e+01
>>>>> 1.0 5.38e+11 1.0 5.4e+07 2.0e+04 9.1e+02  4 88 88 82 49 100100100100100
>>>>> 24130606   43326249   11249 4.26e+03 11249 4.26e+03 100
>>>>>
>>>>> On Tue, Jan 25, 2022 at 1:49 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>>>
>>>>>>> Note that Mark's logs have been switching back and forth between
>>>>>>> -use_gpu_aware_mpi and changing number of ranks -- we won't have that
>>>>>>> information if we do manual timing hacks. This is going to be a routine
>>>>>>> thing we'll need on the mailing list and we need the provenance to go with
>>>>>>> it.
>>>>>>>
>>>>>>
>>>>>> GPU aware MPI crashes sometimes so to be safe, while debugging, I had
>>>>>> it off. It works fine here so it has been on in the last tests.
>>>>>> Here is a comparison.
>>>>>>
>>>>>>
>>>>> <tt.tar>
>>>>>
>>>>>
>>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220126/557a2adb/attachment.html>


More information about the petsc-dev mailing list