[petsc-dev] Kokkos/Crusher perforance

Sun Jan 23 23:58:42 CST 2022

On Sun, Jan 23, 2022 at 11:22 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>
> On Jan 24, 2022, at 12:16 AM, Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>
>
> On Sun, Jan 23, 2022 at 10:44 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>   Junchao,
>>
>>      Without GPU aware MPI, is it moving the entire vector to the CPU and
>> doing the scatter and moving everything back or does it just move up
>> exactly what needs to be sent to the other ranks and move back exactly what
>> it received from other ranks?
>>
> It only moves entries needed, using a kernel to pack/unpack them.
>
>
>     Ok, that pack kernel is Kokkos?  How come the pack times take so
> little time compared to the MPI sends in the locks those times are much
> smaller than the VecScatter times? Is the logging correct for how much
> stuff is sent up and down?
>
Yes, the pack/unpack kernels are kokkos.  I need to check the profiling.

>
>
>>     It is moving 4.74e+02 * 1e+6 bytes total data up and then down. Is
>> that a reasonable amount?
>>
>>     Why is it moving 800 distinct counts up and 800 distinct counts down
>> when the MatMult is done 400 times, shouldn't it be 400 counts?
>>
>>   Mark,
>>
>>      Can you run both with GPU aware MPI?
>>
>>
>>   Norm, AXPY, pointwisemult roughly the same.
>>
>>
>> On Jan 23, 2022, at 11:24 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>> Ugh, try again. Still a big difference, but less.  Mat-vec does not
>> change much.
>>
>> On Sun, Jan 23, 2022 at 7:12 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>  You have debugging turned on on crusher but not permutter
>>>
>>> On Jan 23, 2022, at 6:37 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>> * Perlmutter is roughly 5x faster than Crusher on the one node 2M eq
>>> test. (small)
>>> This is with 8 processes.
>>>
>>> * The next largest version of this test, 16M eq total and 8 processes,
>>> fails in memory allocation in the mat-mult setup in the Kokkos Mat.
>>>
>>> * If I try to run with 64 processes on Perlmutter I get this error in
>>> initialization. These nodes have 160 Gb of memory.
>>> (I assume this is related to these large memory requirements from
>>> loading packages, etc....)
>>>
>>> Thanks,
>>> Mark
>>>
>>> + srun -n64 -N1 --cpu-bind=cores --ntasks-per-core=1 ../ex13
>>> -dm_plex_box_faces 4,4,4 -petscpartitioner_simple_process_grid 4,4,4
>>> -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
>>> -dm_refine 6 -dm_view -pc_type jacobi -log
>>> _view -ksp_view -use_gpu_aware_mpi false -dm_mat_type aijkokkos
>>> -dm_vec_type kokkos -log_trace
>>> + tee jac_out_001_kokkos_Perlmutter_6_8.txt
>>> [48]PETSC ERROR: --------------------- Error Message
>>> --------------------------------------------------------------
>>> [48]PETSC ERROR: GPU error
>>> [48]PETSC ERROR: cuda error 2 (cudaErrorMemoryAllocation) : out of memory
>>> [48]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>>> shooting.
>>> [48]PETSC ERROR: Petsc Development GIT revision: v3.16.3-683-gbc458ed4d8
>>>  GIT Date: 2022-01-22 12:18:02 -0600
>>> [48]PETSC ERROR: /global/u2/m/madams/petsc/src/snes/tests/data/../ex13
>>> on a arch-perlmutter-opt-gcc-kokkos-cuda named nid001424 by madams Sun Jan
>>> 23 15:19:56 2022
>>> [48]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2
>>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2
>>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler
>>> -rdynamic -DLANDAU_DIM=2 -DLAN
>>> DAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --with-cc=cc --with-cxx=CC
>>> --with-fc=ftn --LDFLAGS=-lmpifort_gnu_91
>>> --with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc
>>> --COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
>>>  --with-debugging=0 --download-metis --download-parmetis --with-cuda=1
>>> --with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1
>>> --with-zlib=1 --download-kokkos --download-kokkos-kernels
>>> --with-kokkos-kernels-tpl=0 --with-
>>> make-np=8 PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
>>> [48]PETSC ERROR: #1 initialize() at
>>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:72
>>> [48]PETSC ERROR: #2 initialize() at
>>> /global/u2/m/madams/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:343
>>> [48]PETSC ERROR: #3 PetscDeviceInitializeTypeFromOptions_Private() at
>>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:319
>>> [48]PETSC ERROR: #4 PetscDeviceInitializeFromOptions_Internal() at
>>> /global/u2/m/madams/petsc/src/sys/objects/device/interface/device.cxx:449
>>> [48]PETSC ERROR: #5 PetscInitialize_Common() at
>>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:963
>>> [48]PETSC ERROR: #6 PetscInitialize() at
>>> /global/u2/m/madams/petsc/src/sys/objects/pinit.c:1238
>>>
>>>
>>> On Sun, Jan 23, 2022 at 8:58 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>>
>>>>
>>>> On Sat, Jan 22, 2022 at 6:22 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>>    I cleaned up Mark's last run and put it in a fixed-width font. I
>>>>> realize this may be too difficult but it would be great to have identical
>>>>> runs to compare with on Summit.
>>>>>
>>>>
>>>> I was planning on running this on Perlmutter today, as well as some
>>>> sanity checks like all GPUs are being used. I'll try PetscDeviceView.
>>>>
>>>> Junchao modified the timers and all GPU > CPU now, but he seemed to
>>>> move the timers more outside and Barry wants them tight on the "kernel".
>>>> I think Junchao is going to work on that so I will hold off.
>>>> (I removed the the Kokkos wait stuff and seemed to run a little faster
>>>> but I am not sure how deterministic the timers are, and I did a test with
>>>> GAMG and it was fine.)
>>>>
>>>>
>>>>>
>>>>>    As Jed noted Scatter takes a long time but the pack and unpack take
>>>>> no time? Is this not timed if using Kokkos?
>>>>>
>>>>>
>>>>> --- Event Stage 2: KSP Solve only
>>>>>
>>>>> MatMult              400 1.0 8.8003e+00 1.1 1.06e+11 1.0 2.2e+04
>>>>> 8.5e+04 0.0e+00  2 55 61 54  0  70 91100100   95,058   132,242      0
>>>>> 0.00e+00    0 0.00e+00 100
>>>>> VecScatterBegin      400 1.0 1.3391e+00 2.6 0.00e+00 0.0 2.2e+04
>>>>> 8.5e+04 0.0e+00  0  0 61 54  0   7  0100100        0         0      0
>>>>> 0.00e+00    0 0.00e+00  0
>>>>> VecScatterEnd        400 1.0 1.3240e+00 1.3 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   9  0  0  0        0         0      0
>>>>> 0.00e+00    0 0.00e+00  0
>>>>> SFPack               400 1.0 1.8276e-03 1.2 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0        0         0      0
>>>>> 0.00e+00    0 0.00e+00  0
>>>>> SFUnpack             400 1.0 6.2653e-05 1.6 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0        0         0      0
>>>>> 0.00e+00    0 0.00e+00  0
>>>>>
>>>>> KSPSolve               2 1.0 1.2540e+01 1.0 1.17e+11 1.0 2.2e+04
>>>>> 8.5e+04 1.2e+03  3 60 61 54 60 100100100      73,592   116,796      0
>>>>> 0.00e+00    0 0.00e+00 100
>>>>> VecTDot              802 1.0 1.3551e+00 1.2 3.36e+09 1.0 0.0e+00
>>>>> 0.0e+00 8.0e+02  0  2  0  0 40  10  3  0      19,627    52,599      0
>>>>> 0.00e+00    0 0.00e+00 100
>>>>> VecNorm              402 1.0 9.0151e-01 2.2 1.69e+09 1.0 0.0e+00
>>>>> 0.0e+00 4.0e+02  0  1  0  0 20   5  1  0  0   14,788   125,477      0
>>>>> 0.00e+00    0 0.00e+00 100
>>>>> VecAXPY              800 1.0 8.2617e-01 1.0 3.36e+09 1.0 0.0e+00
>>>>> 0.0e+00 0.0e+00  0  2  0  0  0   7  3  0  0   32,112    61,644      0
>>>>> 0.00e+00    0 0.00e+00 100
>>>>> VecAYPX              398 1.0 8.1525e-01 1.6 1.67e+09 1.0 0.0e+00
>>>>> 0.0e+00 0.0e+00  0  1  0  0  0   5  1  0  0   16,190    20,689      0
>>>>> 0.00e+00    0 0.00e+00 100
>>>>> VecPointwiseMult     402 1.0 3.5694e-01 1.0 8.43e+08 1.0 0.0e+00
>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   3  1  0  0   18,675    38,633      0
>>>>> 0.00e+00    0 0.00e+00 100
>>>>>
>>>>>
>>>>>
>>>>> On Jan 22, 2022, at 12:40 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>> And I have a new MR with if you want to see what I've done so far.
>>>>>
>>>>>
>>>>> <jac_out_001_kokkos_Crusher_6_1_notpl.txt>
>>> <jac_out_001_kokkos_Perlmutter_6_1.txt>
>>>
>>>
>>> <jac_out_001_kokkos_Crusher_6_1_notpl.txt>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220123/86d66d53/attachment-0001.html>