<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div><div class="">Interesting, Is this with all native Kokkos kernels or do some kokkos kernels use rocm? </div><div class=""><br class=""></div><div class="">I ask because VecNorm is 4 times higher than VecDot, I would not expect that and VecAXPY is less than 1/4 the performance of VecAYPX (I would not expect that)</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">MatMult 400 1.0 1.0288e+00 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 0 54 0 0 0 43 91 0 0 0 98964 0 0 0.00e+00 0 0.00e+00 100</div><div class="">MatView 2 1.0 3.3745e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</div><div class="">KSPSolve 2 1.0 2.3989e+00 1.0 1.12e+11 1.0 0.0e+00 0.0e+00 0.0e+00 1 60 0 0 0 100100 0 0 0 46887 220,001 0 0.00e+00 0 0.00e+00 100</div><div class="">VecTDot 802 1.0 4.7745e-01 1.0 3.29e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 20 3 0 0 0 6882 15,426 0 0.00e+00 0 0.00e+00 100</div><div class="">VecNorm 402 1.0 1.1532e-01 1.0 1.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 5 1 0 0 0 14281 62,757 0 0.00e+00 0 0.00e+00 100</div><div class="">VecCopy 4 1.0 2.1859e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</div><div class="">VecSet 4 1.0 2.1910e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</div><div class="">VecAXPY 800 1.0 5.5739e-01 1.0 3.28e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 23 3 0 0 0 5880 14,666 0 0.00e+00 0 0.00e+00 100</div><div class="">VecAYPX 398 1.0 1.0668e-01 1.0 1.63e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 4 1 0 0 0 15284 71,218 0 0.00e+00 0 0.00e+00 100</div><div class="">VecPointwiseMult 402 1.0 1.0930e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 1 0 0 0 7534 33,579 0 0.00e+00 0 0.00e+00 100</div><div class="">PCApply 402 1.0 1.0940e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 1 0 0 0 7527 33,579 0 0.00e+00 0 0.00e+00 100</div><div class=""><br class=""></div><div class=""><br class=""></div><div><br class=""><blockquote type="cite" class=""><div class="">On Jan 21, 2022, at 9:46 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" class="">mfadams@lbl.gov</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class=""><div class=""><br class=""></div><div class=""> But in particular look at the VecTDot and VecNorm CPU flop rates compared to the GPU, much lower, this tells me the MPI_Allreduce is likely hurting performance in there also a great deal. It would be good to see a single MPI rank job to compare to see performance without the MPI overhead.</div></div></blockquote><div class=""><br class=""></div><div class="">Here are two single processor runs, with a whole GPU. It's not clear of --ntasks-per-gpu=1 refers to the GPU socket (4 of them) or the GPUs (8).</div><div class=""> </div></div></div>
<span id="cid:f_kyp89ufi1"><jac_out_001_kokkos_Crusher_3_1_gpuawarempi.txt></span><span id="cid:f_kyp89uff0"><jac_out_001_kokkos_Crusher_4_1_gpuawarempi.txt></span></div></blockquote></div><br class=""></body></html>