<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html;

      charset=windows-1252">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 6/15/19 12:46 AM, Hapla Vaclav

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:D8F24703-A4DC-4C94-B1CC-7D2CC7235D32@erdw.ethz.ch">

      <meta http-equiv="Content-Type" content="text/html;

        charset=windows-1252">

      <div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:

        space; line-break: after-white-space;" class="">

        <br class="">

        <div><br class="">

          <blockquote type="cite" class="">

            <div class="">On 14 Jun 2019, at 21:53, Jakub Kruzik <<a

                href="mailto:jakub.kruzik@vsb.cz" class=""

                moz-do-not-send="true">jakub.kruzik@vsb.cz</a>>

              wrote:</div>

            <br class="Apple-interchange-newline">

            <div class="">

              <div class="">The problem is that you need to write the

                file with an optimal stripe count/size in the first

                place. An unaware user who just uses something like cp

                will end up with the default stripe count which is

                usually 1.<br class="">

              </div>

            </div>

          </blockquote>

          <div><br class="">

          </div>

          <div>Sure. This is clear I guess. I should add that it can be

            a bit challenging to "defeat" the linux page cache. E.g.

            writing a file and reading it right away can result in

            ridiculously high read rate as it is actually read from RAM

            :-) <br>

          </div>

        </div>

      </div>

    </blockquote>

    As far as I know, Lustre does not use the linux page cache (on the

    server-side). Since version 2.9 it has a server-side cache, but that

    is supposed to be used for small files only. You can try to use lfs

    ladvise -a dontneed <file>, but there is no guarantee that if

    the file is in the cache, it will be cleared. See

    <a class="moz-txt-link-freetext" href="http://doc.lustre.org/lustre_manual.xhtml#idm140012896072288">http://doc.lustre.org/lustre_manual.xhtml#idm140012896072288</a><br>

    <blockquote type="cite"

      cite="mid:D8F24703-A4DC-4C94-B1CC-7D2CC7235D32@erdw.ethz.ch">

      <div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:

        space; line-break: after-white-space;" class="">

        <div>

          <div><br class="">

          </div>

          <div>What I'm doing to cope with both issues, I always</div>

          <div>1) remove data.striped.h5</div>

          <div>2) set the stripe settings to the

            non-existing data.striped.h5, which creates

            new data.striped.h5 with zero size</div>

          <div>3) copy the file over from original data.h5 stored

            somewhere else to that data.striped.h5</div>

          <br class="">

          <blockquote type="cite" class="">

            <div class="">

              <div class=""><br class="">

                For large files, you should just set the stripe count to

                the number of OSTs. Your results seem to support this.<br

                  class="">

              </div>

            </div>

          </blockquote>

          <div><br class="">

          </div>

          <div>Sure. Would be cool to have some clear limit for "large"

            ;-) But in these case it's definitely better to overshoot

            the number of stripes rather than underestimate.</div>

        </div>

      </div>

    </blockquote>

    Agreed. I would say a large file is of a size where you actually

    care how fast you are reading :)<br>

    <blockquote type="cite"

      cite="mid:D8F24703-A4DC-4C94-B1CC-7D2CC7235D32@erdw.ethz.ch">

      <div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:

        space; line-break: after-white-space;" class="">

        <div>

          <br class="">

          <blockquote type="cite" class="">

            <div class="">

              <div class=""><br class="">

                For the small mesh and 64 nodes, you are reading just 2

                MiB per process. I think that collective I/O should give

                you a significant improvement.<br class="">

              </div>

            </div>

          </blockquote>

          <div><br class="">

          </div>

          <div>OK, I'm giving it another shot now when the results with

            non-collective look credible. I'm curious about that

            "significant" ;-)</div>

          <div><br class="">

          </div>

          <div>But even if you are right, it's kind of tricky to say

            when this toggle should be turned on, or even decide it

            automatically in petsc...</div>

        </div>

      </div>

    </blockquote>

    <div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:

      space; line-break: after-white-space;" class="">

      <div>Note that the default number of aggregators is usually equal

        to the number of OSTs (or stripe count?). I would try setting

        cb_nodes to a multiple of the number of OSTs close to the number

        of nodes used.<br>

      </div>

    </div>

    <blockquote type="cite"

      cite="mid:D8F24703-A4DC-4C94-B1CC-7D2CC7235D32@erdw.ethz.ch">

      <div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:

        space; line-break: after-white-space;" class="">

        <div>

          <blockquote type="cite" class="">

            <div class="">

              <div class=""><br class="">

                Also, it would be interesting to know what performance

                you get from a single process reading from a single OST.

                I think you should be able to get 0.5-2.5 GiB/s which is

                what you are getting from 36 OSTs (~70 MiB/s per OST).<br

                  class="">

              </div>

            </div>

          </blockquote>

          <div><br class="">

          </div>

          <div>Wait, if you look at the table, it's a bit outdated

            (before Atlanta), sorry for confusion. The new graphs on

            slide 18 show the rate of approx. 10.5/3.5 = 3 GiB/s for the

            128M mesh.</div>

          <div><br class="">

          </div>

          <div>Here are graphs showing load time for 3 different stripe

            counts and several different cpu counts.</div>

          <div>128M elements: <a

              href="https://polybox.ethz.ch/index.php/s/kBC4ZY6bWOAWCMY"

              class="" moz-do-not-send="true">https://polybox.ethz.ch/index.php/s/kBC4ZY6bWOAWCMY</a></div>

          <div>256M elements: <a

              href="https://polybox.ethz.ch/index.php/s/F7SvNWuCiBUKiIz"

              class="" moz-do-not-send="true">https://polybox.ethz.ch/index.php/s/F7SvNWuCiBUKiIz</a></div>

          <div><br class="">

          </div>

          <div>For the 256M one I got up to ~4.5 GiB/s.</div>

          <div><br class="">

          </div>

          <div>It's slowing down with growing number of cpus. I wonder

            whether it could be further improved, but it's not a big

            deal for now.</div>

          <br class="">

        </div>

      </div>

    </blockquote>

    <p>For 12k processes, you are trying to read less than 2 MiB by each

      process, and each OST has more than 340 clients. In this case, you

      should read on a subset of processes and then distribute -

      effectively what should collective I/O do, if the settings are

      correct.</p>

    <blockquote type="cite"

      cite="mid:D8F24703-A4DC-4C94-B1CC-7D2CC7235D32@erdw.ethz.ch">

      <div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode:

        space; line-break: after-white-space;" class="">

        <div>

          <blockquote type="cite" class="">

            <div class="">

              <div class=""><br class="">

                BTW, since you also used Salomon for testing, I found

                some old tests I did there with pure MPI I/O, and I was

                able to get 18.5 GiB/s read for 1 GiB file on 108

                processes / 54 nodes, 54 OSTs, 4 MiB stripe.<br class="">

              </div>

            </div>

          </blockquote>

          <div><br class="">

          </div>

          OK, but it's probably not a good time to try to reproduce

          these just now. The current greeting message:</div>

        <div><br class="">

        </div>

        <div>Planned Salomon /Scratch Maintanance From 2019-06-18 09:00

          Till 2019-06-21 13:00<br class="">

                                      (2019-06-11 08:58:35)             

                         <br class="">

          <br class="">

          We plan to upgrade Lustre stack. We hope to resolve some

          performance issues<br class="">

          with SCRATCH.</div>

        <div><br class="">

        </div>

        <div><br class="">

        </div>

        <div>Thanks,</div>

        <div>Vaclav</div>

        <div><br class="">

          <blockquote type="cite" class="">

            <div class="">

              <div class=""><br class="">

                Best,<br class="">

              </div>

            </div>

          </blockquote>

          <blockquote type="cite" class="">

            <div class="">

              <div class=""><br class="">

                Jakub<br class="">

                <br class="">

                <br class="">

                On 6/14/19 12:31 PM, Hapla Vaclav via petsc-dev wrote:<br

                  class="">

                <blockquote type="cite" class="">I take back one thing I

                  mentioned in my talk in Atlanta. I think I said that

                  Lustre striping does not really influence the read

                  performance. With my latest results in hand, I must

                  point out this is not true. I might have been confused

                  by some former Piz Daint Lustre performance issues

                  and/or HDF5 library issues I mentioned.<br class="">

                  <br class="">

                  Here are my latest slides from PASC19.<br class="">

                  <a

                    href="https://polybox.ethz.ch/index.php/s/PPZLSyZOKo3UXPS"

                    class="" moz-do-not-send="true">https://polybox.ethz.ch/index.php/s/PPZLSyZOKo3UXPS</a><br

                    class="">

                  <br class="">

                  On slide 18, there is some comparison for different

                  stripe settings. I can now see a speed-up of ~4 for 1

                  vs 12 stripes (which is actually the number of cores

                  per node) for the mesh with 128M elements. The times

                  are very similar for 8 and 64 computation nodes.<br

                    class="">

                  <br class="">

                  Toby, could you maybe forward this message to the

                  meeting attendees? I don't want to leave anybody

                  confused.<br class="">

                  <br class="">

                  Thanks,<br class="">

                  Vaclav<br class="">

                </blockquote>

              </div>

            </div>

          </blockquote>

        </div>

        <br class="">

      </div>

    </blockquote>

  </body>

</html>