<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div><div class="">Interesting, Is this with all native Kokkos kernels or do some kokkos kernels use rocm? </div><div class=""><br class=""></div><div class="">I ask because VecNorm is 4 times higher than VecDot, I would not expect that and VecAXPY is less than 1/4 the performance of VecAYPX (I would not expect that)</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">MatMult              400 1.0 1.0288e+00 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00  0 54  0  0  0  43 91  0  0  0 98964       0      0 0.00e+00    0 0.00e+00 100</div><div class="">MatView                2 1.0 3.3745e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0</div><div class="">KSPSolve               2 1.0 2.3989e+00 1.0 1.12e+11 1.0 0.0e+00 0.0e+00 0.0e+00  1 60  0  0  0 100100  0  0  0 46887   220,001      0 0.00e+00    0 0.00e+00 100</div><div class="">VecTDot              802 1.0 4.7745e-01 1.0 3.29e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0  20  3  0  0  0  6882   15,426      0 0.00e+00    0 0.00e+00 100</div><div class="">VecNorm              402 1.0 1.1532e-01 1.0 1.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   5  1  0  0  0 14281   62,757      0 0.00e+00    0 0.00e+00 100</div><div class="">VecCopy                4 1.0 2.1859e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0</div><div class="">VecSet                 4 1.0 2.1910e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0</div><div class="">VecAXPY              800 1.0 5.5739e-01 1.0 3.28e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0  23  3  0  0  0  5880   14,666      0 0.00e+00    0 0.00e+00 100</div><div class="">VecAYPX              398 1.0 1.0668e-01 1.0 1.63e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   4  1  0  0  0 15284   71,218      0 0.00e+00    0 0.00e+00 100</div><div class="">VecPointwiseMult     402 1.0 1.0930e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   5  1  0  0  0  7534   33,579      0 0.00e+00    0 0.00e+00 100</div><div class="">PCApply              402 1.0 1.0940e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   5  1  0  0  0  7527   33,579      0 0.00e+00    0 0.00e+00 100</div><div class=""><br class=""></div><div class=""><br class=""></div><div><br class=""><blockquote type="cite" class=""><div class="">On Jan 21, 2022, at 9:46 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" class="">mfadams@lbl.gov</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class=""><div class=""><br class=""></div><div class="">           But in particular look at the VecTDot and VecNorm CPU flop rates compared to the GPU, much lower, this tells me the MPI_Allreduce is likely hurting performance in there also a great deal. It would be good to see a single MPI rank job to compare to see performance without the MPI overhead.</div></div></blockquote><div class=""><br class=""></div><div class="">Here are two single processor runs, with a whole GPU. It's not clear of --ntasks-per-gpu=1 refers to the GPU socket (4 of them) or the GPUs (8).</div><div class=""> </div></div></div>

<span id="cid:f_kyp89ufi1"><jac_out_001_kokkos_Crusher_3_1_gpuawarempi.txt></span><span id="cid:f_kyp89uff0"><jac_out_001_kokkos_Crusher_4_1_gpuawarempi.txt></span></div></blockquote></div><br class=""></body></html>