[petsc-dev] Kokkos/Crusher perforance

Fri Jan 21 20:55:21 CST 2022

Interesting, Is this with all native Kokkos kernels or do some kokkos kernels use rocm? 

I ask because VecNorm is 4 times higher than VecDot, I would not expect that and VecAXPY is less than 1/4 the performance of VecAYPX (I would not expect that)

MatMult              400 1.0 1.0288e+00 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00  0 54  0  0  0  43 91  0  0  0 98964       0      0 0.00e+00    0 0.00e+00 100
MatView                2 1.0 3.3745e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
KSPSolve               2 1.0 2.3989e+00 1.0 1.12e+11 1.0 0.0e+00 0.0e+00 0.0e+00  1 60  0  0  0 100100  0  0  0 46887   220,001      0 0.00e+00    0 0.00e+00 100
VecTDot              802 1.0 4.7745e-01 1.0 3.29e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0  20  3  0  0  0  6882   15,426      0 0.00e+00    0 0.00e+00 100
VecNorm              402 1.0 1.1532e-01 1.0 1.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   5  1  0  0  0 14281   62,757      0 0.00e+00    0 0.00e+00 100
VecCopy                4 1.0 2.1859e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
VecSet                 4 1.0 2.1910e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
VecAXPY              800 1.0 5.5739e-01 1.0 3.28e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0  23  3  0  0  0  5880   14,666      0 0.00e+00    0 0.00e+00 100
VecAYPX              398 1.0 1.0668e-01 1.0 1.63e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   4  1  0  0  0 15284   71,218      0 0.00e+00    0 0.00e+00 100
VecPointwiseMult     402 1.0 1.0930e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   5  1  0  0  0  7534   33,579      0 0.00e+00    0 0.00e+00 100
PCApply              402 1.0 1.0940e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   5  1  0  0  0  7527   33,579      0 0.00e+00    0 0.00e+00 100

> On Jan 21, 2022, at 9:46 PM, Mark Adams <mfadams at lbl.gov> wrote:
> 
> 
>            But in particular look at the VecTDot and VecNorm CPU flop rates compared to the GPU, much lower, this tells me the MPI_Allreduce is likely hurting performance in there also a great deal. It would be good to see a single MPI rank job to compare to see performance without the MPI overhead.
> 
> Here are two single processor runs, with a whole GPU. It's not clear of --ntasks-per-gpu=1 refers to the GPU socket (4 of them) or the GPUs (8).
>  
> <jac_out_001_kokkos_Crusher_3_1_gpuawarempi.txt><jac_out_001_kokkos_Crusher_4_1_gpuawarempi.txt>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220121/a55bad5c/attachment.html>