[petsc-dev] Kokkos/Crusher perforance
Barry Smith
bsmith at petsc.dev
Fri Jan 21 20:55:21 CST 2022
Interesting, Is this with all native Kokkos kernels or do some kokkos kernels use rocm?
I ask because VecNorm is 4 times higher than VecDot, I would not expect that and VecAXPY is less than 1/4 the performance of VecAYPX (I would not expect that)
MatMult 400 1.0 1.0288e+00 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 0 54 0 0 0 43 91 0 0 0 98964 0 0 0.00e+00 0 0.00e+00 100
MatView 2 1.0 3.3745e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
KSPSolve 2 1.0 2.3989e+00 1.0 1.12e+11 1.0 0.0e+00 0.0e+00 0.0e+00 1 60 0 0 0 100100 0 0 0 46887 220,001 0 0.00e+00 0 0.00e+00 100
VecTDot 802 1.0 4.7745e-01 1.0 3.29e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 20 3 0 0 0 6882 15,426 0 0.00e+00 0 0.00e+00 100
VecNorm 402 1.0 1.1532e-01 1.0 1.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 5 1 0 0 0 14281 62,757 0 0.00e+00 0 0.00e+00 100
VecCopy 4 1.0 2.1859e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
VecSet 4 1.0 2.1910e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
VecAXPY 800 1.0 5.5739e-01 1.0 3.28e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 23 3 0 0 0 5880 14,666 0 0.00e+00 0 0.00e+00 100
VecAYPX 398 1.0 1.0668e-01 1.0 1.63e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 4 1 0 0 0 15284 71,218 0 0.00e+00 0 0.00e+00 100
VecPointwiseMult 402 1.0 1.0930e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 1 0 0 0 7534 33,579 0 0.00e+00 0 0.00e+00 100
PCApply 402 1.0 1.0940e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 1 0 0 0 7527 33,579 0 0.00e+00 0 0.00e+00 100
> On Jan 21, 2022, at 9:46 PM, Mark Adams <mfadams at lbl.gov> wrote:
>
>
> But in particular look at the VecTDot and VecNorm CPU flop rates compared to the GPU, much lower, this tells me the MPI_Allreduce is likely hurting performance in there also a great deal. It would be good to see a single MPI rank job to compare to see performance without the MPI overhead.
>
> Here are two single processor runs, with a whole GPU. It's not clear of --ntasks-per-gpu=1 refers to the GPU socket (4 of them) or the GPUs (8).
>
> <jac_out_001_kokkos_Crusher_3_1_gpuawarempi.txt><jac_out_001_kokkos_Crusher_4_1_gpuawarempi.txt>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220121/a55bad5c/attachment.html>
More information about the petsc-dev
mailing list