<div dir="ltr">I am getting some funny timings and I'm trying to figure it out. <div>I figure the gPU flop rates are bit higher because the timers are inside of the CPU timers, but <b>some are a lot bigger or inverted</b> <div><font face="monospace"><br></font></div><div><font face="monospace">--- Event Stage 2: KSP Solve only<br><br>MatMult 400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04 0.0e+00 2 55 62 54 0 68 91100100 0 671849 857147 0 0.00e+00 0 0.00e+00 100<br>MatView 2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>KSPSolve 2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04 1.2e+03 2 60 62 54 60 100100100100100 512399 804048 0 0.00e+00 0 0.00e+00 100<br>SFPack 400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>SFUnpack 400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecTDot 802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00 8.0e+02 0 2 0 0 40 13 3 0 0 67 <b>69996 488328</b> 0 0.00e+00 0 0.00e+00 100<br>VecNorm 402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00 4.0e+02 0 1 0 0 20 6 1 0 0 33 54744 571507 0 0.00e+00 0 0.00e+00 100<br>VecCopy 4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecSet 4 1.0 3.8051e-0316.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecAXPY 800 1.0 8.6160e-0113.6 3.36e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 6 3 0 0 0 <b>247787 448304</b> 0 0.00e+00 0 0.00e+00 100<br>VecAYPX 398 1.0 1.6831e+0031.1 1.67e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 5 1 0 0 0 63107 77030 0 0.00e+00 0 0.00e+00 100<br>VecPointwiseMult 402 1.0 3.8729e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 2 1 0 0 0 138502 262413 0 0.00e+00 0 0.00e+00 100<br>VecScatterBegin 400 1.0 1.1947e+0035.1 0.00e+00 0.0 3.7e+05 6.1e+04 0.0e+00 0 0 62 54 0 5 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecScatterEnd 400 1.0 6.2969e+00 8.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>PCApply 402 1.0 3.8758e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 2 1 0 0 0 138396 262413 0 0.00e+00 0 0.00e+00 100<br>---------------------------------------------------------------------------------------------------------------------------------------------------------------<br></font></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 22, 2022 at 11:10 AM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 22, 2022 at 10:04 AM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End() right?</div></blockquote><div>No, PetscLogGpuTime() does not know the flops of the caller.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 21, 2022 at 9:47 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div><div> Mark,</div><div><br></div> Fix the logging before you run more. It will help with seeing the true disparity between the MatMult and the vector ops.<div><br></div><div><br><div><blockquote type="cite"><div>On Jan 21, 2022, at 9:37 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr">Here is one with 2M / GPU. Getting better.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 21, 2022 at 9:17 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div> Matt is correct, vectors are way too small.<div><br></div><div> BTW: Now would be a good time to run some of the Report I benchmarks on Crusher to get a feel for the kernel launch times and performance on VecOps.</div><div><br></div><div> Also Report 2.</div><div><br></div><div> Barry</div><div><br><div><br><blockquote type="cite"><div>On Jan 21, 2022, at 7:58 PM, Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:</div><br><div><div dir="ltr"><div dir="ltr">On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).<div>This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are similar (mat-vec is a little faster w/o, the total is about the same, call it noise)<br><div><br></div><div>I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 cores on the node, then when using 1 core/GPU. With the same size problem of course.</div><div>I was thinking MatMult should be faster with just one MPI process. Oh well, worry about that later.</div><div><br></div><div>The bigger problem, and I have observed this to some extent with the Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are expensive or crazy expensive.</div>You can see (attached) and the times here that the solve is dominated by not-mat-vec:</div><div><br><div><span style="font-family:monospace">------------------------------------------------------------------------------------------------------------------------</span><br></div><div><font face="monospace">Event Count Time (sec) Flop --- Global --- --- Stage ---- <b>Total GPU </b> - CpuToGpu - - GpuToCpu - GPU<br> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R <b>Mflop/s Mflop/s</b> Count Size Count Size %F<br>---------------------------------------------------------------------------------------------------------------------------------------------------------------<br></font></div><div><font face="monospace">17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "MatMult 400" jac_out_00*5_8_gpuawaremp*<br>MatMult 400 1.0 <b>1.2507e+00</b> 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 0.0e+00 1 55 62 54 0 27 91100100 0 <b>668874 0</b> 0 0.00e+00 0 0.00e+00 100<br>17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "KSPSolve 2" jac_out_001*_5_8_gpuawaremp*<br>KSPSolve 2 1.0 <b>4.4173e+00</b> 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 1.2e+03 4 60 62 54 61 100100100100100 <b>208923 1094405</b> 0 0.00e+00 0 0.00e+00 100</font><br></div></div><div><font face="monospace"><br></font></div>Notes about flop counters here, <div>* that MatMult flops are not logged as GPU flops but something is logged nonetheless.<div>* The GPU flop rate is 5x the total flop rate in KSPSolve :\<br><div>* I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are at < 1%.</div></div></div></div></blockquote><div><br></div><div>This looks complicated, so just a single remark:</div><div><br></div><div>My understanding of the benchmarking of vector ops led by Hannah was that you needed to be much</div><div>bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I would think you would be</div><div>at 10% of peak or something right off the bat at these sizes. Barry, is that right?</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div>Anway, not sure how to proceed but I thought I would share.</div><div>Maybe ask the Kokkos guys if the have looked at Crusher.</div><div><br></div><div>Mark</div></div></div></div></blockquote></div>-- <br><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>
</div></blockquote></div><br></div></div></blockquote></div>
<span id="gmail-m_-8561488623817931590gmail-m_-9217502836458641567gmail-m_-1042935854083030742cid:f_kyp816vp0"><jac_out_001_kokkos_Crusher_6_8_gpuawarempi.txt></span></div></blockquote></div><br></div></div></blockquote></div>
</blockquote></div></div>
</blockquote></div>