<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <span dir="ltr"><<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi,Michael,<div>  You can add -log_sync besides -log_view, which adds barriers to certain events but measures barrier time separately from the events. I find this option makes it easier to interpret log_view output.</div></div></blockquote><div><br></div><div>That is great (good to know).</div><div><br></div><div>This should give us a better idea if your large VecScatter costs are from slow communication or if it catching some sort of load imbalance.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_extra"><span class="HOEnZb"><font color="#888888"><br clear="all"><div><div class="m_6325123414585814924gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div></font></span><div><div class="h5">

<br><div class="gmail_quote">On Wed, May 30, 2018 at 3:27 AM, Michael Becker <span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-<wbr>giessen.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF">

    Barry: On its way. Could take a couple days again.<br>

    <br>

    Junchao: I unfortunately don't have access to a cluster with a

    faster network. This one has a mixed 4X QDR-FDR InfiniBand 2:1

    blocking fat-tree network, which I realize causes parallel slowdown

    if the nodes are not connected to the same switch. Each node has 24

    processors (2x12/socket) and four NUMA domains (two for each

    socket).<br>

    The ranks are usually not distributed perfectly even, i.e. for 125

    processes, of the six required nodes, five would use 21 cores and

    one 20.<br>

    Would using another CPU type make a difference communication-wise? I

    could switch to faster ones (on the same network), but I always

    assumed this would only improve performance of the stuff that is

    unrelated to communication.<span class="m_6325123414585814924gmail-HOEnZb"><font color="#888888"><br>

    <br>

    Michael</font></span><div><div class="m_6325123414585814924gmail-h5"><br>

    <br>

    <br>

    <blockquote type="cite">

      <div class="m_6325123414585814924gmail-m_-6224261030758995442moz-text-html" lang="x-unicode">

        <div dir="ltr">

          <div>The log files have something like "Average time for zero

            size MPI_Send(): 1.84231e-05". It looks you ran on a cluster

            with a very slow network. A typical machine should give less

            than 1/10 of the latency you have. An easy way to try is

            just running the code on a machine with a faster network and

            see what happens.<br>

          </div>

          <br>

          <div>Also, how many cores & numa domains does a compute

            node have? I could not figure out how you distributed the

            125 MPI ranks evenly.</div>

        </div>

        <div class="gmail_extra"><br clear="all">

          <div>

            <div class="m_6325123414585814924gmail-m_-6224261030758995442gmail_signature">

              <div dir="ltr">--Junchao Zhang</div>

            </div>

          </div>

          <br>

          <div class="gmail_quote">On Tue, May 29, 2018 at 6:18 AM,

            Michael Becker <span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-gie<wbr>ssen.de</a>></span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div bgcolor="#FFFFFF">

                <p>Hello again,</p>

                <p>here are the updated log_view files for 125 and 1000

                  processors. I ran both problems twice, the first time

                  with all processors per node allocated ("-1.txt"), the

                  second with only half on twice the number of nodes

                  ("-2.txt").<br>

                </p>

                <span> <br>

                  <blockquote type="cite">

                    <blockquote type="cite">

                      <pre>On May 24, 2018, at 12:24 AM, Michael Becker <a class="m_6325123414585814924gmail-m_-6224261030758995442m_541343124460185301moz-txt-link-rfc2396E" href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank"><Michael.Becker@physik.uni-gie<wbr>ssen.de></a> wrote:

I noticed that for every individual KSP iteration, six vector objects are created and destroyed (with CG, more with e.g. GMRES). 

</pre>

                    </blockquote>

                    <pre>   Hmm, it is certainly not intended at vectors be created and destroyed within each KSPSolve() could you please point us to the code that makes you think they are being created and destroyed?   We create all the work vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. Not that this would be a measurable distance.

</pre>

                  </blockquote>

                  <br>

                </span> I mean this, right in the log_view output:<br>

                <br>

                <blockquote type="cite"><font size="-1">Memory usage is

                    given in bytes:<br>

                  </font> <font size="-1"><br>

                    Object Type Creations Destructions Memory

                    Descendants' Mem.<br>

                    Reports information only for process 0.<br>

                  </font> <font size="-1"><br>

                    --- Event Stage 0: Main Stage<br>

                  </font> <font size="-1"><br>

                    ...<br>

                  </font> <font size="-1"><br>

                    --- Event Stage 1: First Solve<br>

                  </font> <font size="-1"><br>

                    ...<br>

                  </font> <font size="-1"><br>

                    --- Event Stage 2: Remaining Solves<br>

                  </font> <font size="-1"><br>

                    Vector 23904 23904 1295501184 0. </font></blockquote>

                I logged the exact number of KSP iterations over the 999

                timesteps and its exactly 23904/6 = 3984. <span>

                  <p>Michael<br>

                  </p>

                  <p><br>

                  </p>

                  <br>

                  <div class="m_6325123414585814924gmail-m_-6224261030758995442m_541343124460185301moz-cite-prefix">Am

                    24.05.2018 um 19:50 schrieb Smith, Barry F.:<br>

                  </div>

                </span>

                <div>

                  <div class="m_6325123414585814924gmail-m_-6224261030758995442h5">

                    <blockquote type="cite">

                      <pre>  Please send the log file for 1000 with cg as the solver.

   You should make a bar chart of each event for the two cases to see which ones are taking more time and which are taking less (we cannot tell with the two logs you sent us since they are for different solvers.)

</pre>

                      <blockquote type="cite">

                        <pre>On May 24, 2018, at 12:24 AM, Michael Becker <a class="m_6325123414585814924gmail-m_-6224261030758995442m_541343124460185301moz-txt-link-rfc2396E" href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank"><Michael.Becker@physik.uni-gie<wbr>ssen.de></a> wrote:

I noticed that for every individual KSP iteration, six vector objects are created and destroyed (with CG, more with e.g. GMRES). 

</pre>

                      </blockquote>

                      <pre>   Hmm, it is certainly not intended at vectors be created and destroyed within each KSPSolve() could you please point us to the code that makes you think they are being created and destroyed?   We create all the work vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. Not that this would be a measurable distance.

</pre>

                      <blockquote type="cite">

                        <pre>This seems kind of wasteful, is this supposed to be like this? Is this even the reason for my problems? Apart from that, everything seems quite normal to me (but I'm not the expert here).

Thanks in advance.

Michael

<log_view_125procs.txt><log_vi<wbr>ew_1000procs.txt>

</pre>

                      </blockquote>

                    </blockquote>

                    <br>

                  </div>

                </div>

              </div>

            </blockquote>

          </div>

          <br>

        </div>

      </div>

    </blockquote>

    <br>

  </div></div></div>

</blockquote></div><br></div></div></div>

</blockquote></div><br></div></div>