<div dir="ltr">This is a 3D fluid dynamics code, it uses arakawa C type grids and curvilinear coordinates in nonhydrostatic navier stokes, we also add realistic stratification (Temperature / Density) and subgrid scale for turbulence. What we are solving here is just a seamount with a velocity forcing from one side and is just 5 pressure solvers or iterations.<div><br></div><div>PETSc is used via the DMDAs to set up the grids and arrays and do (almost) every calculation in a distributed manner, the pressure solver is implicit and carried out with the KSP module. I/O is still serial.</div><div><br></div><div>I am attaching the run outputs with the format 60mNP.txt with NP the number of processors used. These are large files you can read with tail -n 140 [filename] for the -log_view part </div><div><br></div><div>Thanks for your help,</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Mar 22, 2019 at 3:40 PM Zhang, Junchao <<a href="mailto:jczhang@mcs.anl.gov">jczhang@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>

<div dir="ltr">

<div dir="ltr"><br>

</div>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Fri, Mar 22, 2019 at 4:55 PM Manuel Valera <<a href="mailto:mvalera-w@sdsu.edu" target="_blank">mvalera-w@sdsu.edu</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div dir="ltr">No, is the same problem running with different number of processors, i have data from 1 to 20 processors in increments of 20 processors/1 node, and additionally for 1 processor.</div>

</blockquote>

<div><br>

</div>

<div>That means you used strong scaling. If we combine VecScatterBegin/End, from 20 cores, to 100, 200 cores, it took 2%, 13%, 18% of the execution time respectively. It looks very unscalable. I do not know why. </div>

<div>VecScatterBegin took the same time with 100 and 200 cores. My explanation is  VecScatterBegin just packs data and then calls non-blocking MPI_Isend. However, VecScatterEnd has to wait for data to come.</div>

<div>Could you tell us more about your problem, for example, is it 2D or 3D, what is the communication pattern, how many neighbors each rank has. Also attach the whole log files for -log_view so that we can know the problem better.</div>

<div>Thanks.</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div dir="ltr">

<div></div>

</div>

<br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Fri, Mar 22, 2019 at 2:48 PM Zhang, Junchao <<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>

<div dir="ltr">Did you change problem size with different runs?

<div><br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Fri, Mar 22, 2019 at 4:09 PM Manuel Valera <<a href="mailto:mvalera-w@sdsu.edu" target="_blank">mvalera-w@sdsu.edu</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">Hello,

<div><br>

</div>

<div>I repeated the timings with the -log_sync option and now i get for 200 processors / 20 nodes:</div>

<div><br>

</div>

<div>

<div>------------------------------------------------------------------------------------------------------------------------</div>

<div>Event                         Count      Time (sec)     Flop                             --- Global ---  --- Stage ---   Total</div>

<div>                               Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s</div>

<div>------------------------------------------------------------------------------------------------------------------------</div>

</div>

<div><br>

</div>

<div>

<div>VecScatterBarrie    3014 1.0 5.6771e+01 3.9 0.00e+00 0.0 <b>0.0e+00 0.0e+00 0.0e+00  5 </b> 0  0  0  0   5  0  0  0  0     0</div>

<div>VecScatterBegin    3014 1.0 3.1684e+01 2.0 0.00e+00 0.0 <b>4.2e+06 1.1e+06 2.8e+01  4</b>  0 63 56  0   4  0 63 56  0     0</div>

<div>VecScatterEnd       2976 1.0 1.1383e+02 1.8 0.00e+00 0.0 <b>0.0e+00 0.0e+00 0.0e+00 14</b>  0  0  0  0  14  0  0  0  0     0</div>

</div>

<div><br>

</div>

<div>With 100 processors / 10 nodes:</div>

<div><br>

</div>

<div>

<div>VecScatterBarrie    3010 1.0 7.4430e+01 5.0 0.00e+00 0.0 <b>0.0e+00 0.0e+00 0.0e+00  7</b>  0  0  0  0   7  0  0  0  0     0</div>

<div>VecScatterBegin    3010 1.0 3.8504e+01 2.4 0.00e+00 0.0 <b>1.6e+06 2.0e+06 2.8e+01  4 </b> 0 71 66  0   4  0 71 66  0     0</div>

<div>VecScatterEnd       2972 1.0 8.5158e+01 1.2 0.00e+00 0.0 <b>0.0e+00 0.0e+00 0.0e+00  9 </b> 0  0  0  0   9  0  0  0  0     0</div>

</div>

<div><br>

</div>

<div>And with 20 processors / 1 node:</div>

<div><br>

</div>

<div>

<div>VecScatterBarrie    2596 1.0 4.0614e+01 7.3 0.00e+00 0.0<b>  0.0e+00 0.0e+00 0.0e+00  4 </b> 0  0  0  0   4  0  0  0  0     0</div>

<div>VecScatterBegin     2596 1.0 1.4970e+01 1.3 0.00e+00 0.0 <b>1.2e+05 4.0e+06 3.0e+01  1 </b> 0 81 61  0   1  0 81 61  0     0</div>

<div>VecScatterEnd       2558 1.0 1.4903e+01 1.3 0.00e+00 0.0<b>  0.0e+00 0.0e+00 0.0e+00  1 </b> 0  0  0  0   1  0  0  0  0     0</div>

</div>

<div><br>

</div>

<div>Can you help me interpret this? what i see is the End portion taking more relative time and Begin staying the same beyond one node, also Barrier and Begin counts are the same every time, but how do i estimate communication times from here?</div>

<div><br>

</div>

<div>Thanks, </div>

<div><br>

</div>

</div>

</div>

</div>

</div>

</div>

<br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Wed, Mar 20, 2019 at 3:24 PM Zhang, Junchao <<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>

<div dir="ltr">

<div>Forgot to mention long VecScatter time might also due to local memory copies. If the communication pattern has large local to local (self to self)  scatter, which often happens thanks to locality, then the memory copy time is counted in VecScatter. You

 can analyze your code's communication pattern to see if it is the case.</div>

<br clear="all">

<div>

<div dir="ltr" class="gmail-m_-175149887065910466m_-4272741259341272259gmail-m_-5984098628761319167gmail-m_8316174214426870688m_-3889203236599324760gmail-m_6425353157528491198gmail-m_2253007601058659192gmail_signature">

<div dir="ltr">--Junchao Zhang</div>

</div>

</div>

<br>

</div>

<br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Wed, Mar 20, 2019 at 4:44 PM Zhang, Junchao via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>

<div dir="ltr">

<div dir="ltr"><br>

</div>

<br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Wed, Mar 20, 2019 at 4:18 PM Manuel Valera <<a href="mailto:mvalera-w@sdsu.edu" target="_blank">mvalera-w@sdsu.edu</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">Thanks for your answer, so for example i have a log for 200 cores across 10 nodes that reads:

<div><br>

</div>

<div>

<div>------------------------------------------------------------------------------------------------------------------------</div>

<div>Event                           Count      Time (sec)     Flop                             --- Global ---  --- Stage ---   Total</div>

<div>                                Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s</div>

<div>----------------------------------------------------------------------------------------------------------------------</div>

</div>

<div>VecScatterBegin     3014 1.0 4.5550e+01 2.6 0.00e+00 0.0 <b>4.2e+06 1.1e+06 2.8e+01  4</b>  0 63 56  0   4  0 63 56  0     0<br>

</div>

<div>

<div>VecScatterEnd       2976 1.0 1.2143e+02 1.7 0.00e+00 0.0 <b>0.0e+00 0.0e+00 0.0e+00 14</b>  0  0  0  0  14  0  0  0  0     0</div>

</div>

<div><br>

</div>

<div>While for 20 nodes at one node i have:</div>

</div>

</div>

</div>

</div>

</blockquote>

<div> What does that mean?</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">

<div></div>

<div>

<div>VecScatterBegin     2596 1.0 2.9142e+01 2.1 0.00e+00 0.0 <b>1.2e+05 4.0e+06 3.0e+01  2</b>  0 81 61  0   2  0 81 61  0     0</div>

<div>VecScatterEnd       2558 1.0 8.0344e+01 7.9 0.00e+00 0.0  <b>0.0e+00 0.0e+00 0.0e+00  3</b>  0  0  0  0   3  0  0  0  0     0</div>

</div>

<div><br>

</div>

<div>Where do i see the max/min ratio in here? and why End step is all 0.0e00 in both but still grows from 3% to 14% of total time? It seems i would need to run again with the -log_sync option, is this correct?</div>

<div><br>

</div>

</div>

</div>

</div>

</div>

</blockquote>

<div>e.g., 2.1, 7.9. MPI send/recv are in VecScatterBegin(). VecScatterEnd() only does MPI_Wait. That is why it has zero messages. Yes, run with -log_sync and see what happens.</div>

<div> </div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">

<div></div>

<div>Different question, can't i estimate the total communication time if i had a typical communication time per MPI message times the number of MPI messages reported in the log? or it doesn't work like that?</div>

<div><br>

</div>

</div>

</div>

</div>

</div>

</blockquote>

<div>Probably not work because you have multiple processes doing send/recv at the same time. They might saturate the bandwidth. Petsc also does computation/communication overlapping.</div>

<div> </div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">

<div></div>

<div>Thanks.</div>

<div><br>

</div>

<div><br>

</div>

<div><br>

</div>

<div><br>

</div>

</div>

</div>

</div>

</div>

<br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Wed, Mar 20, 2019 at 2:02 PM Zhang, Junchao <<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">

<div dir="ltr">

<div>See the "Mess   AvgLen  Reduct" number in each log stage.  Mess is the total number of messages sent in an event over all processes.  AvgLen is average message len. Reduct is the number of global reduction.</div>

<div>Each event like VecScatterBegin/End has a maximal execution time over all processes, and a max/min ratio.  %T is sum(execution time of the event on each process)/sum(execution time of the stage on each process). %T indicates how expensive the event is.

 It is a number you should pay attention to. <br>

</div>

<div>If your code is imbalanced (i.e., with a big max/min ratio), then the performance number is skewed and becomes misleading because some processes are just waiting for others. Then, besides -log_view, you can add -log_sync, which adds an extra MPI_Barrier

 for each event to let them start at the same time. With that, it is easier to interpret the number.<br>

</div>

<div>src/vec/vscat/examples/ex4.c is a tiny example for VecScatter logging. <br>

</div>

<div><br>

</div>

<div>--Junchao Zhang<br>

</div>

<br>

</div>

</div>

</div>

</div>

<br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Wed, Mar 20, 2019 at 2:58 PM Manuel Valera via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div dir="ltr">Hello,

<div><br>

</div>

<div>I am working on timing my model, which we made MPI scalable using petsc DMDAs, i want to know more about the output log and how to calculate a total communication times for my runs, so far i see we have "MPI Messages" and "MPI Messages Lengths" in the

 log, along VecScatterEnd and VecScatterBegin reports. </div>

<div><br>

</div>

<div>My question is, how do i interpret these number to get a rough estimate on how much overhead we have just from MPI communications times in my model runs?</div>

<div><br>

</div>

<div>Thanks,</div>

<div><br>

</div>

<div><br>

</div>

</div>

</blockquote>

</div>

</div>

</blockquote>

</div>

</blockquote>

</div>

</div>

</div>

</blockquote>

</div>

</div>

</blockquote>

</div>

</blockquote>

</div>

</div>

</div>

</div>

</blockquote>

</div>

</blockquote>

</div>

</div>

</div>

</blockquote></div>