<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>And one last follow up here (thanks to a ton of help off-list

      from Philip!).  The core problem is a bad interaction between the

      Argobots memory allocator, the stack size being used by Margo, and

      transparent huge page (THP) handling on some systems.  It can

      result in newly allocated memory taking several milliseconds, at

      very inconvenient times, to page in for use.<br>

    </p>

    <p>You can work around it on affected systems by using

      prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0) to disable automatic THP

      activation.</p>

    <p>We'll pursue a longer term solution in the form of a custom stack

      allocator for Margo; that effort will be tracked on github at

      <a class="moz-txt-link-freetext" href="https://github.com/mochi-hpc/mochi-margo/issues/166">https://github.com/mochi-hpc/mochi-margo/issues/166</a>.</p>

    <p>thanks,</p>

    <p>-Phil<br>

    </p>

    <div class="moz-cite-prefix">On 11/10/21 7:45 PM, Phil Carns wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:0d9a618a-6775-7100-ca16-6ff22529956a@mcs.anl.gov">

      <p>Just to follow up for folks on the list: we've made some

        headway on isolating the source of the delays.  It isn't network

        specific, but rather an issue in the Argobots ULT creation path

        on some platforms.  It's being tracked now at <a class="moz-txt-link-freetext" href="https://github.com/pmodels/argobots/issues/369" moz-do-not-send="true">https://github.com/pmodels/argobots/issues/369</a>

        with a standalone Argobots reproducer.  It should be a lot

        easier to hunt down from this point.<br>

      </p>

      <p>Some details:</p>

      <ul>

        <li>This only happens on particular machines (Frontera@TACC and

          Cooley@ALCF for example) but we don't know why just yet.<br>

        </li>

        <li>It only happens if you use dedicated threads (execution

          streams) for handling RPCs.  This is common for a lot of Mochi

          data services, though</li>

        <li>The test case also happens to highlight that we have some

          memory allocation/caching tuning work to do even on platforms

          that behave correctly.  We have a lot of options on that

          front, but we'll circle back to it later once the major

          platform-specific performance bug is resolved.</li>

      </ul>

      <p>thanks,</p>

      <p>-Phil<br>

      </p>

      <blockquote type="cite" cite="mid:7B252C38-02B7-40EE-A799-7856565F7C0E@rutgers.edu"> One

        more thing: I tried the same test case with a non-trivial size

        for the bulk pull (8MB) just to shift the timing some. I still

        see the extra time appear on the 4th and 8th timestep. I had

        wondered if the issue was that the timestep period was small

        enough that some background task was not completing in time, but

        that does not seem to be the case:

        <div class=""><br class="">

          <div class="">

            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024255773.000000,

              "dur": 76836.500000,"args":{"GUID":3,"Parent GUID":0}},</div>

            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024332702.500000,

              "dur": 3420.500000,"args":{"GUID":5,"Parent GUID":0}},</div>

            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024336178.250000,

              "dur": 3369.000000,"args":{"GUID":7,"Parent GUID":0}},</div>

            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024339577.500000,

              "dur": 5052.000000,"args":{"GUID":9,"Parent GUID":0}},</div>

            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024344658.750000,

              "dur": 3326.750000,"args":{"GUID":11,"Parent GUID":0}},</div>

            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024348016.500000,

              "dur": 3697.000000,"args":{"GUID":13,"Parent GUID":0}},</div>

            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024351743.500000,

              "dur": 3333.250000,"args":{"GUID":15,"Parent GUID":0}},</div>

            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024355106.000000,

              "dur": 4500.000000,"args":{"GUID":17,"Parent GUID":0}},</div>

            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024359634.500000,

              "dur": 3336.250000,"args":{"GUID":19,"Parent GUID":0}},</div>

            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024362999.000000,

              "dur": 3296.000000,"args":{"GUID":21,"Parent GUID":0}},</div>

            <div><br class="">

              <blockquote type="cite" class="">

                <div class="">On Nov 5, 2021, at 12:40 PM, Phil Carns

                  <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>

                  wrote:</div>

                <br class="Apple-interchange-newline">

                <div class="">

                  <div class="">

                    <p class=""><br class="">

                    </p>

                    <div class="moz-cite-prefix">On 11/5/21 12:23 PM,

                      Philip Davis wrote:<br class="">

                    </div>

                    <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class=""> That’s a good find. I’m reading through

                      the code and documentation, and I’m having a

                      little trouble understanding what the difference

                      between 0 and -1 for that last argument is when

                      the second to last argument is 0. I see in

                      documentation:

                      <div class=""><br class="">

                      </div>

                      <div class=""><span style="caret-color: rgb(64,

                          64, 64); color: rgb(64, 64, 64); font-family:

                          Lato, proxima-nova, "Helvetica

                          Neue", Arial, sans-serif; font-size:

                          16px; background-color: rgb(252, 252, 252);" class="">The third argument indicates whether

                          an Argobots execution stream (ES) should be

                          created to run the Mercury progress loop. If

                          this argument is set to 0, the progress loop

                          is going to run in the context of the main ES

                          (this should be the standard scenario, unless

                          you have a good reason for not using the main

                          ES, such as the main ES using MPI primitives

                          that could block the progress loop). A value

                          of 1 will make Margo create an ES to run the

                          Mercury progress loop. The fourth argument is

                          the number of ES to create and use for

                          executing RPC handlers. A value of 0 will make

                          Margo execute RPCs in the ES that called </span><code class="code literal docutils notranslate" style="box-sizing: border-box; font-family:

                          SFMono-Regular, Menlo, Monaco, Consolas,

                          "Liberation Mono", "Courier

                          New", Courier, monospace; white-space:

                          nowrap; max-width: 100%; background-color:

                          rgb(255, 255, 255); border: 1px solid rgb(225,

                          228, 229); padding: 2px 5px; color: rgb(231,

                          76, 60); overflow-x: auto;"><span class="pre" style="box-sizing: border-box;">margo_init</span></code><span style="caret-color: rgb(64, 64, 64); color:

                          rgb(64, 64, 64); font-family: Lato,

                          proxima-nova, "Helvetica Neue",

                          Arial, sans-serif; font-size: 16px;

                          background-color: rgb(252, 252, 252);" class="">. A value of -1 will make Margo

                          execute the RPCs in the ES running the

                          progress loop. A positive value will make

                          Margo create new ESs to run the RPCs.</span></div>

                      <div class=""><font class="" size="3" face="Lato,

                          proxima-nova, Helvetica Neue, Arial,

                          sans-serif" color="#404040"><span style="caret-color: rgb(64, 64, 64);

                            background-color: rgb(252, 252, 252);" class=""><br class="">

                          </span></font></div>

                      <span class="">What is the difference between the

                        ‘main ES’ (last two arguments 0,-1) and the 'ES

                        that called margo_init’  (last two arguments

                        0,0) in the absence of me creating new execution

                        streams? Or maybe I’m not interpreting the

                        documentation correctly?<br class="">

                      </span></blockquote>

                    <p class=""><br class="">

                    </p>

                    <p class="">You are right, those are equivalent when

                      the next to last argument is 0 :)  The main ES and

                      progress thread ES are one and the same in that

                      case, so the RPCs go to the same place either way.</p>

                    <p class="">I've narrowed down a little further and

                      found that the stalls occur when there are 2

                      dedicated RPC ESs but not when there is just 1

                      dedicated RPC ES.  That isolates the problem

                      slightly further, in that it's not just some cost

                      associated with relaying ULTs to another thread

                      (that's happening every time in the 1 handler case

                      too), but something that only happens when

                      multiple ESes could potentially service an RPC.</p>

                    <p class="">thanks,</p>

                    <p class="">-Phil<br class="">

                      <span class=""></span></p>

                    <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class=""> <span class="">

                        <blockquote type="cite" class="">On Nov 5, 2021,

                          at 10:29 AM, Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>

                          wrote:<br class="">

                          <br class="">

                          Srinivasan Ramesh (U. Oregon) has done some

                          work on fine-grained RPC component timing, but

                          it's not in the mainline margo tree so we'll

                          need to do a little more work to look at it.<br class="">

                          <br class="">

                          In the mean time on a hunch I found that I can

                          make the latency consistent on Cooley by

                          altering the margo_init() arguments to

                          be (..., 0, -1) in server.c (meaning that no

                          additional execution streams are used at all;

                          all mercury progress and all rpc handlers are

                          executed using user-level threads in

                          the process's primary execution stream (OS

                          thread).<br class="">

                          <br class="">

                          It's expected that there would be some more

                          jitter jumping across OS threads for RPC

                          handling, but it shouldn't be that

                          extreme, regular, or system-specific.<br class="">

                          <br class="">

                          Thanks again for the test case and the Apex

                          instrumentation; this is the sort of thing

                          that's normally really hard to isolate.<br class="">

                          <br class="">

                          thanks,<br class="">

                          <br class="">

                          -Phil<br class="">

                          <br class="">

                          On 11/5/21 10:09 AM, Philip Davis wrote:<br class="">

                          <blockquote type="cite" cite="mid:B0643541-F015-4EB6-AD30-F9F85B196465@rutgers.edu" class=""> That’s extremely interesting. <br class="">

                            <br class="">

                            Are there any internal timers in Margo that

                            can tell what the delay was between the

                            server’s progress thread queueing the rpc

                            and the handler thread starting to handle

                            it? If I’m understanding <a href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmochi.readthedocs.io%2Fen%2Flatest%2Fgeneral%2F03_rpc_model.html&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411604270%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fgQJROtM5XQkNijao33AfJkl1HDFEX42Q2M0%2BR0POoo%3D&reserved=0" originalsrc="https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html" shash="oKne11hHUhUv1Ux+D9kgS9epPH1xWB0TI25yTBd0L4NOHfRBMy2+aJHs6rm2DNESeRYRUFw6LyClIuNSy4KENw9yKX/ow/F9jAkorj3zgD/SoAcwfh0mZ3vR+WKI5BzRxpAgn2MHn/Lw3EsYBIDPc5qgEWwmE8dcW8ige9YEzXo=" class="moz-txt-link-freetext" moz-do-not-send="true">https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html</a> correctly,

                            it seems to me like that is the most likely

                            place for non-deterministic delay to be

                            introduced by argobots in the client ->

                            server direction.<br class="">

                            <br class="">

                            I just ran a quick test where I changed the

                            number of handler threads to 5, and I saw no

                            change in behavior (still 4 and 8, not 5

                            and 10).<br class="">

                            <br class="">

                            <br class="">

                            <blockquote type="cite" class="">On Nov 4,

                              2021, at 9:04 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">

                                <carns@mcs.anl.gov></a> wrote:<br class="">

                              <br class="">

                              I have some more interesting info to

                              share from trying a few different

                              configurations.<br class="">

                              <br class="">

                              sm (on my laptop) and ofi+gni (on

                              theta) do not exhibit this behavior; they

                              have consistent performance across RPCs.<br class="">

                              <br class="">

                              ofi+verbs (on cooley) shows the same

                              thing you were seeing; the 4th and 8th

                              RPCs are slow.<br class="">

                              <br class="">

                              Based on the above, it would sound like

                              a problem with the libfabric/verbs path.

                               But on Jerome's suggestion I tried some

                              other supported transports on cooley as

                              well.  In particular I ran the same

                              benchmark (the same build in fact, I

                              just compiled in support for multiple

                              transports and cycling through them in a

                              single job script with runtime options)

                              with these combinations:<br class="">

                              <br class="">

                              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                                ofi+verbs<br class="">

                              </div>

                              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                                ofi+tcp<br class="">

                              </div>

                              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                                ofi+sockets<br class="">

                              </div>

                              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                                bmi+tcp<br class="">

                              </div>

                              All of them show the same thing!  4th

                              and 8th RPCs are at least an order of

                              magnitude slower than the other RPCs.

                               That was a surprising

                              result.  The bmi+tcp one isn't even using

                              libfabric at all, even though they are all

                              using the same underlying hardware.<br class="">

                              <br class="">

                              I'm not sure what to make of that

                              yet.  Possibly something with threading or

                              signalling?<br class="">

                              <br class="">

                              thanks,<br class="">

                              <br class="">

                              -Phil<br class="">

                              <br class="">

                              On 11/2/21 2:37 PM, Philip Davis wrote:<br class="">

                              <blockquote type="cite" cite="mid:82DF3EFD-9A7E-414C-86F7-1894000C30E1@rutgers.edu" class=""> I’m glad you were able to

                                reproduce it on a different system,

                                thanks for letting me know. I’m not sure

                                what the overlaps between Frontera

                                and Cooley are (that aren’t shared by

                                Summit); a quick look shows they are

                                both Intel, and both FDR, but there’s

                                probably more salient details.<br class="">

                                <br class="">

                                <blockquote type="cite" class="">On Nov

                                  2, 2021, at 2:24 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">

                                    <carns@mcs.anl.gov></a> wrote:<br class="">

                                  <br class="">

                                  Ok.  Interesting.  I didn't see

                                  anything unusual in the timing on

                                  my laptop with sm (other than it being

                                  a bit slow, but I wasn't tuning

                                  or worrying about core affinity

                                  or anything).  On Cooley, a somewhat

                                  older Linux cluster with InfiniBand, I

                                  see the 4th and 8th RPC delay you were

                                  talking about:<br class="">

                                  <br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent GUID":0}},<br class="">

                                  <br class="">

                                  (assuming the first is high due to

                                  connection establishment)<br class="">

                                  <br class="">

                                  I'll check some

                                  other systems/transports, but I wanted

                                  to go ahead and share that I've been

                                  able to reproduce what you were

                                  seeing.<br class="">

                                  <br class="">

                                  thanks,<br class="">

                                  <br class="">

                                  -Phil<br class="">

                                  <br class="">

                                  On 11/2/21 1:49 PM, Philip Davis

                                  wrote:<br class="">

                                  <blockquote type="cite" cite="mid:35E29989-49C6-4AA0-8C41-54D97F1ACBF6@rutgers.edu" class=""> Glad that’s working now.<br class="">

                                    <br class="">

                                    It is the put_wait events, and “dur”

                                    is the right field. Those units

                                    are microseconds.<br class="">

                                    <br class="">

                                    <br class="">

                                    <blockquote type="cite" class="">On

                                      Nov 2, 2021, at 1:12 PM, Phil

                                      Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">

                                        <carns@mcs.anl.gov></a> wrote:<br class="">

                                      <br class="">

                                      Thanks Philip, the "=

                                      {0};" initialization of

                                      that struct got me going.<br class="">

                                      <br class="">

                                      I can run the test case now and it

                                      is producing output in the client

                                      and server perf dirs.  Just to

                                      sanity check what to look for,

                                      I think the problem should be

                                      exhibited in the "put_wait" or

                                      maybe "do_put" trace events on the

                                      client?  For example on my laptop

                                      I see this:<br class="">

                                      <br class="">

carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep

                                      do_put trace_events.0.json<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent GUID":0}},<br class="">

carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait trace_events.0.json<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent GUID":0}},<br class="">

                                      <br class="">

                                      I should look at the "dur" field

                                      right?  What are the units

                                      on that?<br class="">

                                      <br class="">

                                      I'll see if I can run this on a

                                      "real" system shortly.<br class="">

                                      <br class="">

                                      thanks!<br class="">

                                      <br class="">

                                      -Phil<br class="">

                                      <br class="">

                                      On 11/2/21 12:11 PM, Philip Davis

                                      wrote:<br class="">

                                      <blockquote type="cite" cite="mid:104B5045-328B-48EE-A32C-09D25817EED7@rutgers.edu" class=""> Hi Phil,<br class="">

                                        <br class="">

                                        Sorry the data structures are

                                        like that; I wanted to preserve

                                        as much of the RPC size and

                                        ordering in case it ended up

                                        being important.<br class="">

                                        <br class="">

                                        I’m surprised in.odsc.size is

                                        troublesome, as I set

                                        in.odsc.size with the line

                                        `in.odsc.size = sizeof(odsc);`.

                                        I’m not sure what could

                                        be corrupting that value in the

                                        meantime.<br class="">

                                        <br class="">

                                        I don’t set

                                        in.odsc.gdim_size (which was an

                                        oversight, since that’s

                                        non-zero normally), so I’m

                                        less surprised that’s an issue.

                                        I thought I initialized `in`

                                        to zero, but I see I didn’t do

                                        that after all.<br class="">

                                        <br class="">

                                        Maybe change the line

                                        `bulk_gdim_t in;` to

                                        `bulk_gdim_t in = {0};`<br class="">

                                        <br class="">

                                        <br class="">

                                        <br class="">

                                        <blockquote type="cite" class="">On

                                          Nov 2, 2021, at 11:48 AM, Phil

                                          Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true"><carns@mcs.anl.gov></a> wrote:<br class="">

                                          <br class="">

                                          Awesome, thanks Philip.

                                           It came through fine.<br class="">

                                          <br class="">

                                          I started by modifying the job

                                          script slightly to just run it

                                          on my laptop with sm (I

                                          wanted to make sure

                                          I understood the test case,

                                          and how to use apex,

                                          before trying elsewhere).  Does

                                          in.size needs to be set

                                          in client.c?  For me there is

                                          a random value in that

                                          field and it is causing the  

                                            encoder on the forward

                                          to attempt a

                                          very large allocation.  The

                                          same might be true

                                          of gdim_size if it got

                                          past that step.  I started

                                          to alter them but then I

                                          wasn't sure what

                                          the implications were.<br class="">

                                          <br class="">

                                          (fwiw I needed

                                          to include stdlib.h

                                          in common.h, but I've hit

                                          that a couple of times

                                          recently on codes

                                          that didn't previously generate warnings;

                                          I think something in Ubuntu

                                          has gotten strict about

                                          that recently).<br class="">

                                          <br class="">

                                          thanks,<br class="">

                                          <br class="">

                                          -Phil<br class="">

                                          <br class="">

                                          <br class="">

                                          <br class="">

                                          On 11/1/21 4:51 PM,

                                          Philip Davis wrote:<br class="">

                                          <blockquote type="cite" cite="mid:67D76603-B4B9-48E5-9452-966D1E4866D2@rutgers.edu" class=""> Hi Phil,<br class="">

                                            <br class="">

                                            I’ve attached

                                            the reproducer. I see the

                                            4th and 8th issue on

                                            Frontera, but

                                            not Summit. Hopefully

                                            it will build and run

                                            without too

                                            much modification. Let me

                                            know if there are any issues

                                            with running it (or if the

                                            anl listserv eats the

                                            tarball, which I kind of

                                            expect).<br class="">

                                            <br class="">

                                            Thanks,<br class="">

                                            Philip<br class="">

                                            <br class="">

                                            <br class="">

                                            <blockquote type="cite" class="">On Nov 1,

                                              2021, at 11:14 AM, Phil

                                              Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true"><carns@mcs.anl.gov></a> wrote:<br class="">

                                              <br class="">

                                              Hi Philip,<br class="">

                                              <br class="">

                                              (FYI I think the first

                                              image didn't come through

                                              in your email, but I

                                              think the others

                                              are sufficient to get

                                              across what you

                                              are seeing)<br class="">

                                              <br class="">

                                              I don't have any idea

                                              what would cause that.

                                               The recently released libfabric 1.13.2 (available

                                              in spack from

                                              the mochi-spack-packages repo)

                                              includes some fixes to the

                                              rxm provider that could

                                              be relevant to Frontera

                                              and Summit, but nothing

                                              that aligns with what you

                                              are observing.<br class="">

                                              <br class="">

                                              If it were later in

                                              the sequence (much later)

                                              I would speculate

                                              that memory allocation/deallocation cycles

                                              were eventually causing

                                              a hiccup.  We've seen

                                              something like that in the

                                              past, and it's a

                                              theory that we could then

                                              test

                                              with alternative allocators like jemalloc.  That's

not memory allocation jitter that early in the run though.<br class="">

                                              <br class="">

                                              Please do share

                                              your reproducer if you

                                              don't mind!  We can try a

                                              few systems here to at

                                              least isolate if it is

                                              something peculiar

                                              to the InfiniBand path or

                                              if there is a more

                                              general problem in Margo.<br class="">

                                              <br class="">

                                              thanks,<br class="">

                                              <br class="">

                                              -Phil<br class="">

                                              <br class="">

                                              On 10/29/21 3:20 PM,

                                              Philip Davis wrote:<br class="">

                                              <blockquote type="cite" class="">Hello,<br class="">

                                                <br class="">

                                                I apologize in advance

                                                for the winding nature

                                                of this email. I’m

                                                not sure how to ask

                                                my question without explaining

                                                the story of my results

                                                some.<br class="">

                                                <br class="">

                                                I’m doing

                                                some characterization of

                                                our server performance

                                                under load, and I have

                                                a quirk

                                                of performance that I

                                                wanted to run

                                                by people to see if they

                                                make sense. My testing

                                                so far has been

                                                to iteratively send

                                                batches of RPCs

                                                using margo_iforward,

                                                and then measure

                                                the wait time until they

                                                are all complete. On the

                                                server side,

                                                handling the

                                                RPC includes

                                                a margo_bulk_transfer as

                                                a pull initiated

                                                on the server to pull

                                                (for now) 8 bytes.

                                                The payload of the RPC

                                                request is about

                                                500 bytes, and

                                                the response payload is

                                                4 bytes.<br class="">

                                                <br class="">

                                                I’ve isolated my results

                                                down to one server rank

                                                and one client

                                                rank, because it’s an

                                                easier starting point to

                                                reason from. Below

                                                is a graph of some of

                                                my initial results.

                                                These results are from

                                                Frontera. The

                                                median times are good (a

                                                single RPC takes on

                                                the order of 10s        

                                                  of microseconds, which

seems fantastic). However, the outliers are fairly high (note the

                                                log scale of

                                                the y-axis). With only

                                                one

                                                RPC per timestep, for

                                                example, there is a

                                                30x spread between the

                                                median and the max.<br class="">

                                                <br class="">

                                                <img id="x_126DD6C1-A420-4D70-8D1C-96320B7F54E7" src="cid:5EB5DF63-97AA-48FC-8BBE-E666E933D79F" class="" moz-do-not-send="true"><br class="">

                                                <br class="">

                                                I was

                                                hoping (expecting) the

                                                first timestep would be

                                                where the long

                                                replies resided,

                                                but that turned out not

                                                to be the case. Below

                                                are traces from the 1

                                                RPC (blue) and 2 RPC

                                                 (orange) per

                                                timestep cases, 5 trials

                                                of 10 timesteps for each

                                                case (normalized to fix

                                                the same                

                                                  x-axis):<br class="">

                                                <br class="">

<PastedGraphic-6.png><br class="">

                                                <br class="">

                                                What strikes me is how

                                                consistent these

                                                results are

                                                across trials. For the 1

                                                RPC per timestep

                                                case, the 3rd and 7th

                                                timestep are consistently slow (and

                                                the rest are fast).

                                                For the 2 RPC

                                                per timestep case, the

                                                2nd and 4th timestep are

                                                always slow

                                                and sometimes the      

                                                    10th is.

                                                These results

                                                are repeatable with very

                                                rare variation.<br class="">

                                                <br class="">

                                                For the single RPC case,

                                                I recorded some timers

                                                on the server side, and

                                                attempted to

                                                overlay them with

                                                the client side (there

                                                is some unknown offset,

                                                but probably on the

                                                order of 10s

                                                of microseconds at

                                                worst, given

                                                the pattern):<br class="">

                                                <br class="">

<PastedGraphic-7.png><br class="">

                                                <br class="">

                                                I blew up the first

                                                few timesteps of one of

                                                the trials:<br class="">

<PastedGraphic-8.png><br class="">

                                                <br class="">

                                                The different colors

                                                are different segments

                                                of the handler, but

                                                there doesn’t seem to be

anything too interesting going on inside the handler. So it looks like

                                                the time is

                                                being introduced before

                                                the 3rd RPC

                                                handler starts, based on

                                                the where the

                                                gap appears on

                                                the server side.<br class="">

                                                <br class="">

                                                To try and isolate

                                                any dataspaces-specific behavior,

                                                I created a pure Margo

                                                test case that

                                                just sends a single rpc

                                                of the same size

                                                as dataspaces iteratively, whre

                                                the server side does an

                                                8-byte bulk

                                                transfer initiated

                                                by the server, and sends

                                                a response. The results

                                                are similar, except that

                                                it is now the 4th and

                                                8th timestep that are

                                                slow (and the

                                                first timestep is VERY

                                                long, presumably because

                                                rxm communication state

                                                is

being established. DataSpaces has an earlier RPC in its init that

                                                was absorbing

                                                this latency).<br class="">

                                                <br class="">

                                                I got

                                                margo profiling results

                                                for this test case:<br class="">

                                                <br class="">

                                                ```<br class="">

                                                3<br class="">

18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573<br class="">

                                                0xa2a1,term_rpc<br class="">

                                                0x27b5,put_rpc<br class="">

                                                0xd320,__shutdown__<br class="">

0x27b5 ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">

0x27b5 ,0;0.041241646,200.000000000, 0;<br class="">

0xa2a1 ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">

0xa2a1 ,0;0.000009298,1.000000000, 0;<br class="">

                                                ```<br class="">

                                                <br class="">

                                                So I guess my question

                                                at this point is, is

                                                there

                                                any sensible reason why

                                                the 4th and 8th RPC sent

                                                would have a

                                                long response

                                                time? I think

                                                I’ve cleared my code on

                                                the client side and

                                                server side, so

                                                it appears to be latency

                                                being introduced

                                                by Margo, LibFabric, Argobots,

                                                or the underlying OS. I

                                                do see long

                                                timesteps occasionally after

                                                this (perhaps

                                                every 20-30 timesteps)

                                                but these are

                                                not consistent.<br class="">

                                                <br class="">

                                                One last detail: this

                                                does not happen

                                                on Summit. On summit, I

                                                see about

                                                5-7x worse single-RPC performance (250-350 microseconds per

                                                RPC), but without

                                                the intermittent long timesteps.<br class="">

                                                <br class="">

                                                I can provide the

                                                minimal test case if it

                                                would be helpful. I

                                                am using APEX

                                                for timing results,

                                                and the

                                                following dependencies with

                                                Spack:<br class="">

                                                <br class="">

                                                <a class="moz-txt-link-abbreviated

                                                  moz-txt-link-freetext" href="mailto:argobots@1.1" moz-do-not-send="true">argobots@1.1</a>  <a class="moz-txt-link-abbreviated

                                                  moz-txt-link-freetext" href="mailto:json-c@0.15" moz-do-not-send="true">json-c@0.15</a>  <a class="moz-txt-link-abbreviated

                                                  moz-txt-link-freetext" href="mailto:libfabric@1.13.1" moz-do-not-send="true">libfabric@1.13.1</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:mercury@2.0.1" moz-do-not-send="true">mercury@2.0.1</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:mochi-margo@0.9.5" moz-do-not-send="true">mochi-margo@0.9.5</a>  rdma-core@20<br class="">

                                                <br class="">

                                                Thanks,<br class="">

                                                Philip<br class="">

                                                <br class="">

                                                <br class="">

                                                <br class="">

                                                <br class="">

                                                <div class="">_______________________________________________</div>

                                                <div class="">mochi-devel

                                                  mailing list</div>

                                                <br class="Apple-interchange-newline">

                                                <a class="moz-txt-link-abbreviated

                                                  moz-txt-link-freetext" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">

                                                <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=EiX9%2B4KlSbKB7w2ov4OZtE7jCugYdchESAfX0D%2B%2FK7c%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="hT2kkjAIf/zHJhvtfpngbaSAG6BZQ0v653UkY5DV3lvADy7nxpFwHs9680ZLFYC4A/ZkOaDhVaoVT0dhguoFth03P3pU/ankBuXMiNnP4KyfpClP97DwHHcU9nwM04kJb9eamI6B/jo5QLEuZAQ97t2SUt+s9UTzpXLcXc/6/7I=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">

                                                <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vyMI4AQ%2BbXY4WVJocNVN%2BYJp6H1tYsP7meHgbjriejE%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="ncm2uVm0Ugenklbk5rqBtY39R8xrytsxPVYOdy2NqnuC3nOSGasq7//KT8rW9cbT43LvCOivxAxgLyWaANsZOv0dwLdcqrgqnldIxtfDSfJzCXFTgmgI++3ODNu7UIcNUeMtmBw8YYusppzZlNxvPHAunqN44IwZIk9lYE5wtW0=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">

                                              </blockquote>

_______________________________________________<br class="">

                                              mochi-devel mailing list<br class="">

                                              <a class="moz-txt-link-abbreviated

                                                moz-txt-link-freetext" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">

                                              <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411624266%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oVLJaNkVau19KFuQsfeEpIkYUBvfpHcNBiMwy39eYiE%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="eo+nqLYWCkMMdToapGKW5gCChdEC+hnMkybaq9TTMc+BLlej8fWIrt1WeKWex1ubGz/Kolyb76ecsFMW3CL4Je2n/gnoO704YF3Or7fc73FFazgqUg3EPFQRAx3+zz5WR7a1+QiDa2HcoGRw4/KG7diKB6bVs37W+tX6YsHhBJU=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">

                                              <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411634259%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XjXCEsV2TNbLfPuH7iwMuolSCtaxnJZksVRu7V%2FfGNw%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="euRQ8SoMMl/IdjKCp7PxhwDgnaOLyG1nlSdl2QPAU17a7nYFM9JBPnoh1O/jSGuGv0+KkUsZMdBbbwg0ENFlNOUiYCxv1aPqeW0whx609AFP7y+OEMFUuRm9Ir3OvZlY4jzkE5B2v68VbQSLyROuz/2i8WR42H/ni5Na2Gn9noc=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">

                                            </blockquote>

                                            <br class="">

                                          </blockquote>

                                        </blockquote>

                                        <br class="">

                                      </blockquote>

                                    </blockquote>

                                    <br class="">

                                  </blockquote>

                                </blockquote>

                                <br class="">

                              </blockquote>

                            </blockquote>

                            <br class="">

                          </blockquote>

                        </blockquote>

                        <br class="">

                      </span></blockquote>

                  </div>

                </div>

              </blockquote>

            </div>

            <br class="">

          </div>

        </div>

      </blockquote>

    </blockquote>

  </body>

</html>