<div dir="ltr">I should be able to add this profiling now.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 21, 2022 at 10:48 PM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 21, 2022 at 8:08 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div>  Junchao, Mark,<div><br></div><div>     Some of the logging information is non-sensible, MatMult says all flops are done on the GPU (last column) but the GPU flop rate is zero. <br><div><br></div><div>     It looks like  MatMult_SeqAIJKokkos() is missing PetscLogGpuTimeBegin()/End() in fact all the operations in aijkok.kokkos.cxx seem to be missing it. This might explain the crazy 0 GPU flop rate. Can this be fixed ASAP?</div></div></div></blockquote><div>I will add this profiling temporarily.  I may use Kokkos own profiling APIs later.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div><br></div><div>     Regarding VecOps, sure looks the kernel launches are killing performance. </div><div><br></div><div>           But in particular look at the VecTDot and VecNorm CPU flop rates compared to the GPU, much lower, this tells me the MPI_Allreduce is likely hurting performance in there also a great deal. It would be good to see a single MPI rank job to compare to see performance without the MPI overhead.</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br><div><br><blockquote type="cite"><div>On Jan 21, 2022, at 6:41 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr">I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).<div>This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are similar (mat-vec is a little faster w/o, the total is about the same, call it noise)<br><div><br></div><div>I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 cores on the node, then when using 1 core/GPU. With the same size problem of course.</div><div>I was thinking MatMult should be faster with just one MPI process. Oh well, worry about that later.</div><div><br></div><div>The bigger problem, and I have observed this to some extent with the Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are expensive or crazy expensive.</div>You can see (attached) and the times here that the solve is dominated by not-mat-vec:</div><div><br><div><span style="font-family:monospace">------------------------------------------------------------------------------------------------------------------------</span><br></div><div><font face="monospace">Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  <b>Total   GPU </b>   - CpuToGpu -   - GpuToCpu - GPU<br>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R <b>Mflop/s Mflop/s</b> Count   Size   Count   Size  %F<br>---------------------------------------------------------------------------------------------------------------------------------------------------------------<br></font></div><div><font face="monospace">17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "MatMult              400" jac_out_00*5_8_gpuawaremp*<br>MatMult              400 1.0 <b>1.2507e+00</b> 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 <b>668874       0</b>      0 0.00e+00    0 0.00e+00 100<br>17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "KSPSolve               2" jac_out_001*_5_8_gpuawaremp*<br>KSPSolve               2 1.0 <b>4.4173e+00</b> 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 <b>208923   1094405</b>      0 0.00e+00    0 0.00e+00 100</font><br></div></div><div><font face="monospace"><br></font></div>Notes about flop counters here, <div>* that MatMult flops are not logged as GPU flops but something is logged nonetheless.<div>* The GPU flop rate is 5x the total flop rate  in KSPSolve :\<br><div>* I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are at < 1%.</div><div><br></div><div>Anway, not sure how to proceed but I thought I would share.</div><div>Maybe ask the Kokkos guys if the have looked at Crusher.</div><div><br></div><div>Mark</div><div><br><div><font face="monospace"><br></font></div></div></div></div></div>

<span id="gmail-m_-8994623149461994968gmail-m_-5351277351756418520cid:f_kyoyxfo70"><jac_out_001_kokkos_Crusher_5_8_gpuawarempi.txt></span></div></blockquote></div><br></div></div></div></blockquote></div></div>

</blockquote></div>