<div dir="ltr"><div><span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Hi, PETSc developers,</span></div>  I tested Michael Becker's code. The code calls the same KSPSolve 1000 times in the second stage and needs cubic number of processors to run. I ran with 125 ranks and 1000 ranks, with or without -log_sync option. I attach the log view output files and a scaling loss excel file.<div><div>  I profiled the code with 125 processors. It looks {MatSOR, MatMult, MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took ~50% of the time,  The other half time was spent on waiting in MPI.  <span style="text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><span style="background-color:rgb(255,255,255)">MatSOR_SeqAIJ took 30%, mostly in </span>PetscSparseDenseMinusDot().</span></div><div><span style="text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">  I tested it on a 36 cores/node machine. I found 32 ranks/node gave better performance (about 10%) than 36 ranks/node in the 125 ranks testing. </span> I guess it is because processors in the former had more balanced memory bandwidth. I collected PAPI_DP_OPS (double precision operations) and PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files). It looks ranks at the two ends have less DP_OPS and TOT_CYC. </div><div><span style="text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">  Does anyone familiar with the algorithm have quick explanations?</span></div><div><div><br></div><div class="gmail_extra"><div><div class="m_-4928476107018496294m_6474898020160081090gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div>

<br><div class="gmail_quote">On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker <span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-gie<wbr>ssen.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF">

    <p>Hello again,</p>

    this took me longer than I anticipated, but here we go.<br>

    I did reruns of the cases where only half the processes per node

    were used (without -log_sync):<br>

    <br>

    <font face="Courier New, Courier, monospace">                    

      125 procs,1st           125 procs,2nd          1000

      procs,1st          1000 procs,2nd<br>

                         Max        Ratio        </font><font face="Courier New, Courier, monospace"><font face="Courier New,

        Courier, monospace">Max        Ratio</font></font><font face="Courier New, Courier, monospace"><font face="Courier New,

        Courier, monospace"><font face="Courier New, Courier, monospace">       

        </font><font face="Courier New, Courier, monospace"><font face="Courier New, Courier, monospace">Max        Ratio</font></font></font></font><font face="Courier New, Courier, monospace"><font face="Courier New,

        Courier, monospace"><font face="Courier New, Courier, monospace"><font face="Courier New, Courier, monospace"><font face="Courier

              New, Courier, monospace">        </font><font face="Courier New, Courier, monospace"><font face="Courier

                New, Courier, monospace">Max        Ratio<br>

              </font></font></font></font></font>KSPSolve          

      1.203E+02    1.0        1.210E+02    1.0        1.399E+02   

      1.1        1.365E+02    1.0<br>

      VecTDot            6.376E+00    3.7        6.551E+00    4.0       

      7.885E+00    2.9        7.175E+00    3.4<br>

      VecNorm            4.579E+00    7.1        5.803E+00   10.2       

      8.534E+00    6.9        6.026E+00    4.9<br>

      VecScale           1.070E-01    2.1        1.129E-01    2.2       

      1.301E-01    2.5        1.270E-01    2.4<br>

      VecCopy            1.123E-01    1.3        1.149E-01    1.3       

      1.301E-01    1.6        1.359E-01    1.6<br>

      VecSet             7.063E-01    1.7        6.968E-01    1.7       

      7.432E-01    1.8        7.425E-01    1.8<br>

      VecAXPY            1.166E+00    1.4        1.167E+00    1.4       

      1.221E+00    1.5        1.279E+00    1.6<br>

      VecAYPX            1.317E+00    1.6        1.290E+00    1.6       

      1.536E+00    1.9        1.499E+00    2.0<br>

      VecScatterBegin    6.142E+00    3.2        5.974E+00    2.8       

      6.448E+00    3.0        6.472E+00    2.9<br>

      VecScatterEnd      3.606E+01    4.2        3.551E+01    4.0       

      5.244E+01    2.7        4.995E+01    2.7<br>

      MatMult            3.561E+01    1.6        3.403E+01    1.5       

      3.435E+01    1.4        3.332E+01    1.4<br>

      MatMultAdd         1.124E+01    2.0        1.130E+01    2.1       

      2.093E+01    2.9        1.995E+01    2.7<br>

      MatMultTranspose   1.372E+01    2.5        1.388E+01    2.6       

      1.477E+01    2.2        1.381E+01    2.1<br>

      MatSolve           1.949E-02    0.0        1.653E-02    0.0       

      4.789E-02    0.0        4.466E-02    0.0<br>

      MatSOR             6.610E+01    1.3        6.673E+01    1.3       

      7.111E+01    1.3        7.105E+01    1.3<br>

      MatResidual        2.647E+01    1.7        2.667E+01    1.7       

      2.446E+01    1.4        2.467E+01    1.5<br>

      PCSetUpOnBlocks    5.266E-03    1.4        5.295E-03    1.4       

      5.427E-03    1.5        5.289E-03    1.4<br>

      PCApply            1.031E+02    1.0        1.035E+02    1.0       

      1.180E+02    1.0        1.164E+02    1.0<br>

      <br>

    </font>I also slimmed down my code and basically wrote a simple weak

    scaling test (source files attached) so you can profile it yourself.

    I appreciate the offer Junchao, thank you.<br>

    You can adjust the system size per processor at runtime via

    "-nodes_per_proc 30" and the number of repeated calls to the

    function containing KSPsolve() via "-iterations 1000". The physical

    problem is simply calculating the electric potential from a

    homogeneous charge distribution, done multiple times to accumulate

    time in KSPsolve().<br>

    A job would be started using something like<br>

    <blockquote>mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30

      -mesh_size 1E-4 -iterations 1000 \\<br>

       -ksp_rtol 1E-6 \<br>

       -log_view -log_sync\<br>

       -pc_type gamg -pc_gamg_type classical\<br>

       -ksp_type cg \<br>

       -ksp_norm_type unpreconditioned \<br>

       -mg_levels_ksp_type richardson \<br>

       -mg_levels_ksp_norm_type none \<br>

       -mg_levels_pc_type sor \<br>

       -mg_levels_ksp_max_it 1 \<br>

       -mg_levels_pc_sor_its 1 \<br>

       -mg_levels_esteig_ksp_type cg \<br>

       -mg_levels_esteig_ksp_max_it 10 \<br>

       -gamg_est_ksp_type cg<br>

    </blockquote>

    , ideally started on a cube number of processes for a cubical

    process grid.<br>

    Using 125 processes and 10.000 iterations I get the output in

    "log_view_125_new.txt", which shows the same imbalance for me.<span class="m_-4928476107018496294m_6474898020160081090gmail-HOEnZb"><font color="#888888"><br>

    <p>Michael<br>

    </p></font></span><div><div class="m_-4928476107018496294m_6474898020160081090gmail-h5">

    <br>

    <br>

    <div class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626moz-cite-prefix">Am 02.06.2018 um 13:40 schrieb Mark

      Adams:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr"><br>

        <div class="gmail_extra"><br>

          <div class="gmail_quote">On Fri, Jun 1, 2018 at 11:20 PM,

            Junchao Zhang <span dir="ltr"><<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>></span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div dir="ltr">Hi,Michael,

                <div>  You can add -log_sync besides -log_view, which

                  adds barriers to certain events but measures barrier

                  time separately from the events. I find this option

                  makes it easier to interpret log_view output.</div>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>That is great (good to know).</div>

            <div><br>

            </div>

            <div>This should give us a better idea if your large

              VecScatter costs are from slow communication or if it

              catching some sort of load imbalance.</div>

            <div> </div>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div class="gmail_extra"><span class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626HOEnZb"><font color="#888888"><br clear="all">

                    <div>

                      <div class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626m_6325123414585814924gmail_signature">

                        <div dir="ltr">--Junchao Zhang</div>

                      </div>

                    </div>

                  </font></span>

                <div>

                  <div class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626h5">

                    <br>

                    <div class="gmail_quote">On Wed, May 30, 2018 at

                      3:27 AM, Michael Becker <span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-gie<wbr>ssen.de</a>></span>

                      wrote:<br>

                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                        <div bgcolor="#FFFFFF"> Barry: On its way. Could

                          take a couple days again.<br>

                          <br>

                          Junchao: I unfortunately don't have access to

                          a cluster with a faster network. This one has

                          a mixed 4X QDR-FDR InfiniBand 2:1 blocking

                          fat-tree network, which I realize causes

                          parallel slowdown if the nodes are not

                          connected to the same switch. Each node has 24

                          processors (2x12/socket) and four NUMA domains

                          (two for each socket).<br>

                          The ranks are usually not distributed

                          perfectly even, i.e. for 125 processes, of the

                          six required nodes, five would use 21 cores

                          and one 20.<br>

                          Would using another CPU type make a difference

                          communication-wise? I could switch to faster

                          ones (on the same network), but I always

                          assumed this would only improve performance of

                          the stuff that is unrelated to communication.<span class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626m_6325123414585814924gmail-HOEnZb"><font color="#888888"><br>

                              <br>

                              Michael</font></span>

                          <div>

                            <div class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626m_6325123414585814924gmail-h5"><br>

                              <br>

                              <br>

                              <blockquote type="cite">

                                <div class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626m_6325123414585814924gmail-m_-6224261030758995442moz-text-html" lang="x-unicode">

                                  <div dir="ltr">

                                    <div>The log files have something

                                      like "Average time for zero size

                                      MPI_Send(): 1.84231e-05". It looks

                                      you ran on a cluster with a very

                                      slow network. A typical machine

                                      should give less than 1/10 of the

                                      latency you have. An easy way to

                                      try is just running the code on a

                                      machine with a faster network and

                                      see what happens.<br>

                                    </div>

                                    <br>

                                    <div>Also, how many cores & numa

                                      domains does a compute node have?

                                      I could not figure out how you

                                      distributed the 125 MPI ranks

                                      evenly.</div>

                                  </div>

                                  <div class="gmail_extra"><br clear="all">

                                    <div>

                                      <div class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626m_6325123414585814924gmail-m_-6224261030758995442gmail_signature">

                                        <div dir="ltr">--Junchao Zhang</div>

                                      </div>

                                    </div>

                                    <br>

                                    <div class="gmail_quote">On Tue, May

                                      29, 2018 at 6:18 AM, Michael

                                      Becker <span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-gie<wbr>ssen.de</a>></span>

                                      wrote:<br>

                                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                        <div bgcolor="#FFFFFF">

                                          <p>Hello again,</p>

                                          <p>here are the updated

                                            log_view files for 125 and

                                            1000 processors. I ran both

                                            problems twice, the first

                                            time with all processors per

                                            node allocated ("-1.txt"),

                                            the second with only half on

                                            twice the number of nodes

                                            ("-2.txt").<br>

                                          </p>

                                          <span> <br>

                                            <blockquote type="cite">

                                              <blockquote type="cite">

                                                <pre>On May 24, 2018, at 12:24 AM, Michael Becker <a class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626m_6325123414585814924gmail-m_-6224261030758995442m_541343124460185301moz-txt-link-rfc2396E" href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank"><Michael.Becker@physik.uni-gie<wbr>ssen.de></a> wrote:

I noticed that for every individual KSP iteration, six vector objects are created and destroyed (with CG, more with e.g. GMRES). 

</pre>

                                              </blockquote>

                                              <pre>   Hmm, it is certainly not intended at vectors be created and destroyed within each KSPSolve() could you please point us to the code that makes you think they are being created and destroyed?   We create all the work vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. Not that this would be a measurable distance.

</pre>

                                            </blockquote>

                                            <br>

                                          </span> I mean this, right in

                                          the log_view output:<br>

                                          <br>

                                          <blockquote type="cite"><font size="-1">Memory usage is

                                              given in bytes:<br>

                                            </font> <font size="-1"><br>

                                              Object Type Creations

                                              Destructions Memory

                                              Descendants' Mem.<br>

                                              Reports information only

                                              for process 0.<br>

                                            </font> <font size="-1"><br>

                                              --- Event Stage 0: Main

                                              Stage<br>

                                            </font> <font size="-1"><br>

                                              ...<br>

                                            </font> <font size="-1"><br>

                                              --- Event Stage 1: First

                                              Solve<br>

                                            </font> <font size="-1"><br>

                                              ...<br>

                                            </font> <font size="-1"><br>

                                              --- Event Stage 2:

                                              Remaining Solves<br>

                                            </font> <font size="-1"><br>

                                              Vector 23904 23904

                                              1295501184 0. </font></blockquote>

                                          I logged the exact number of

                                          KSP iterations over the 999

                                          timesteps and its exactly

                                          23904/6 = 3984. <span>

                                            <p>Michael<br>

                                            </p>

                                            <p><br>

                                            </p>

                                            <br>

                                            <div class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626m_6325123414585814924gmail-m_-6224261030758995442m_541343124460185301moz-cite-prefix">Am

                                              24.05.2018 um 19:50

                                              schrieb Smith, Barry F.:<br>

                                            </div>

                                          </span>

                                          <div>

                                            <div class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626m_6325123414585814924gmail-m_-6224261030758995442h5">

                                              <blockquote type="cite">

                                                <pre>  Please send the log file for 1000 with cg as the solver.

   You should make a bar chart of each event for the two cases to see which ones are taking more time and which are taking less (we cannot tell with the two logs you sent us since they are for different solvers.)

</pre>

                                                <blockquote type="cite">

                                                  <pre>On May 24, 2018, at 12:24 AM, Michael Becker <a class="m_-4928476107018496294m_6474898020160081090gmail-m_1637528017210283626m_6325123414585814924gmail-m_-6224261030758995442m_541343124460185301moz-txt-link-rfc2396E" href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank"><Michael.Becker@physik.uni-gie<wbr>ssen.de></a> wrote:

I noticed that for every individual KSP iteration, six vector objects are created and destroyed (with CG, more with e.g. GMRES). 

</pre>

                                                </blockquote>

                                                <pre>   Hmm, it is certainly not intended at vectors be created and destroyed within each KSPSolve() could you please point us to the code that makes you think they are being created and destroyed?   We create all the work vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. Not that this would be a measurable distance.

</pre>

                                                <blockquote type="cite">

                                                  <pre>This seems kind of wasteful, is this supposed to be like this? Is this even the reason for my problems? Apart from that, everything seems quite normal to me (but I'm not the expert here).

Thanks in advance.

Michael

<log_view_125procs.txt><log_vi<wbr>ew_1000procs.txt>

</pre>

                                                </blockquote>

                                              </blockquote>

                                              <br>

                                            </div>

                                          </div>

                                        </div>

                                      </blockquote>

                                    </div>

                                    <br>

                                  </div>

                                </div>

                              </blockquote>

                              <br>

                            </div>

                          </div>

                        </div>

                      </blockquote>

                    </div>

                    <br>

                  </div>

                </div>

              </div>

            </blockquote>

          </div>

          <br>

        </div>

      </div>

    </blockquote>

    <br>

  </div></div></div>

</blockquote></div><br></div></div></div></div>