<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 22, 2022 at 9:00 PM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 22, 2022 at 5:00 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div> The GPU flop rate (when 100 percent flops on the GPU) should always be higher than the overall flop rate (the previous column). For large problems they should be similar, for small problems the GPU one may be much higher.<div><br></div><div> If the CPU one is higher (when 100 percent flops on the GPU) something must be wrong with the logging. I looked at the code for the two cases and didn't see anything obvious.</div><div><br></div><div> Junchao and Jacob,</div><div> I think some of the timing code in the Kokkos interface is wrong. </div><div><br></div><div> * The PetscLogGpuTimeBegin/End should be inside the viewer access code not outside it. (The GPU time is an attempt to best time the kernels, not other processing around the use of the kernels, that other stuff is captured in the general LogEventBegin/End.</div></div></blockquote></div></div></blockquote><div>What about potential host to device memory copy before calling a kernel? Should we count it in the kernel time?</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div>Good point </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div> * The use of WaitForKokkos() is confusing and seems inconsistent. </div></div></blockquote><div>I need to have a look. Until now, I have not paid much attention to kokkos profiling.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div> -For example it is used in VecTDot_SeqKokkos() which I would think has a barrier anyways because it puts a scalar result into update? </div><div> -Plus PetscLogGpuTimeBegin/End is suppose to already have suitable system (that Hong added) to ensure the kernel is complete; reading the manual page and looking at Jacobs cupmcontext.hpp it seems to be there so I don't think WaitForKokkos() is needed in most places (or is Kokkos asynchronous and needs this for correctness?) </div><div>But these won't explain the strange result of overall flop rate being higher than GPU flop rate.</div><div><br></div><div> Barry</div><div><br></div><div><br></div><div><br></div><div><br><div><br><blockquote type="cite"><div>On Jan 22, 2022, at 11:44 AM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr">I am getting some funny timings and I'm trying to figure it out. <div>I figure the gPU flop rates are bit higher because the timers are inside of the CPU timers, but <b>some are a lot bigger or inverted</b> <div><font face="monospace"><br></font></div><div><font face="monospace">--- Event Stage 2: KSP Solve only<br><br>MatMult 400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04 0.0e+00 2 55 62 54 0 68 91100100 0 671849 857147 0 0.00e+00 0 0.00e+00 100<br>MatView 2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>KSPSolve 2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04 1.2e+03 2 60 62 54 60 100100100100100 512399 804048 0 0.00e+00 0 0.00e+00 100<br>SFPack 400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>SFUnpack 400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecTDot 802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00 8.0e+02 0 2 0 0 40 13 3 0 0 67 <b>69996 488328</b> 0 0.00e+00 0 0.00e+00 100<br>VecNorm 402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00 4.0e+02 0 1 0 0 20 6 1 0 0 33 54744 571507 0 0.00e+00 0 0.00e+00 100<br>VecCopy 4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecSet 4 1.0 3.8051e-0316.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecAXPY 800 1.0 8.6160e-0113.6 3.36e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 6 3 0 0 0 <b>247787 448304</b> 0 0.00e+00 0 0.00e+00 100<br>VecAYPX 398 1.0 1.6831e+0031.1 1.67e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 5 1 0 0 0 63107 77030 0 0.00e+00 0 0.00e+00 100<br>VecPointwiseMult 402 1.0 3.8729e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 2 1 0 0 0 138502 262413 0 0.00e+00 0 0.00e+00 100<br>VecScatterBegin 400 1.0 1.1947e+0035.1 0.00e+00 0.0 3.7e+05 6.1e+04 0.0e+00 0 0 62 54 0 5 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0<br>VecScatterEnd 400 1.0 6.2969e+00 8.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br>PCApply 402 1.0 3.8758e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 2 1 0 0 0 138396 262413 0 0.00e+00 0 0.00e+00 100<br>---------------------------------------------------------------------------------------------------------------------------------------------------------------<br></font></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 22, 2022 at 11:10 AM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 22, 2022 at 10:04 AM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End() right?</div></blockquote><div>No, PetscLogGpuTime() does not know the flops of the caller.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 21, 2022 at 9:47 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div><div> Mark,</div><div><br></div> Fix the logging before you run more. It will help with seeing the true disparity between the MatMult and the vector ops.<div><br></div><div><br><div><blockquote type="cite"><div>On Jan 21, 2022, at 9:37 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr">Here is one with 2M / GPU. Getting better.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 21, 2022 at 9:17 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div> Matt is correct, vectors are way too small.<div><br></div><div> BTW: Now would be a good time to run some of the Report I benchmarks on Crusher to get a feel for the kernel launch times and performance on VecOps.</div><div><br></div><div> Also Report 2.</div><div><br></div><div> Barry</div><div><br><div><br><blockquote type="cite"><div>On Jan 21, 2022, at 7:58 PM, Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:</div><br><div><div dir="ltr"><div dir="ltr">On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).<div>This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are similar (mat-vec is a little faster w/o, the total is about the same, call it noise)<br><div><br></div><div>I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 cores on the node, then when using 1 core/GPU. With the same size problem of course.</div><div>I was thinking MatMult should be faster with just one MPI process. Oh well, worry about that later.</div><div><br></div><div>The bigger problem, and I have observed this to some extent with the Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are expensive or crazy expensive.</div>You can see (attached) and the times here that the solve is dominated by not-mat-vec:</div><div><br><div><span style="font-family:monospace">------------------------------------------------------------------------------------------------------------------------</span><br></div><div><font face="monospace">Event Count Time (sec) Flop --- Global --- --- Stage ---- <b>Total GPU </b> - CpuToGpu - - GpuToCpu - GPU<br> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R <b>Mflop/s Mflop/s</b> Count Size Count Size %F<br>---------------------------------------------------------------------------------------------------------------------------------------------------------------<br></font></div><div><font face="monospace">17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "MatMult 400" jac_out_00*5_8_gpuawaremp*<br>MatMult 400 1.0 <b>1.2507e+00</b> 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 0.0e+00 1 55 62 54 0 27 91100100 0 <b>668874 0</b> 0 0.00e+00 0 0.00e+00 100<br>17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "KSPSolve 2" jac_out_001*_5_8_gpuawaremp*<br>KSPSolve 2 1.0 <b>4.4173e+00</b> 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 1.2e+03 4 60 62 54 61 100100100100100 <b>208923 1094405</b> 0 0.00e+00 0 0.00e+00 100</font><br></div></div><div><font face="monospace"><br></font></div>Notes about flop counters here, <div>* that MatMult flops are not logged as GPU flops but something is logged nonetheless.<div>* The GPU flop rate is 5x the total flop rate in KSPSolve :\<br><div>* I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are at < 1%.</div></div></div></div></blockquote><div><br></div><div>This looks complicated, so just a single remark:</div><div><br></div><div>My understanding of the benchmarking of vector ops led by Hannah was that you needed to be much</div><div>bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs I would think you would be</div><div>at 10% of peak or something right off the bat at these sizes. Barry, is that right?</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div>Anway, not sure how to proceed but I thought I would share.</div><div>Maybe ask the Kokkos guys if the have looked at Crusher.</div><div><br></div><div>Mark</div></div></div></div></blockquote></div>-- <br><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>
</div></blockquote></div><br></div></div></blockquote></div>
<span id="gmail-m_7178493797419230199gmail-m_-5850858973953305955gmail-m_-8561488623817931590gmail-m_-9217502836458641567gmail-m_-1042935854083030742cid:f_kyp816vp0"><jac_out_001_kokkos_Crusher_6_8_gpuawarempi.txt></span></div></blockquote></div><br></div></div></blockquote></div>
</blockquote></div></div>
</blockquote></div>
<span id="gmail-m_7178493797419230199gmail-m_-5850858973953305955cid:f_kyq28fj80"><jac_out_001_kokkos_Crusher_5_8_notpl.txt></span><span id="gmail-m_7178493797419230199gmail-m_-5850858973953305955cid:f_kyq28fji1"><jac_out_001_kokkos_Crusher_6_8_notpl.txt></span></div></blockquote></div><br></div></div></blockquote></div></div>
</blockquote></div></div>