<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>And one last follow up here (thanks to a ton of help off-list
      from Philip!).  The core problem is a bad interaction between the
      Argobots memory allocator, the stack size being used by Margo, and
      transparent huge page (THP) handling on some systems.  It can
      result in newly allocated memory taking several milliseconds, at
      very inconvenient times, to page in for use.<br>
    </p>
    <p>You can work around it on affected systems by using
      prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0) to disable automatic THP
      activation.</p>
    <p>We'll pursue a longer term solution in the form of a custom stack
      allocator for Margo; that effort will be tracked on github at
      <a class="moz-txt-link-freetext" href="https://github.com/mochi-hpc/mochi-margo/issues/166">https://github.com/mochi-hpc/mochi-margo/issues/166</a>.</p>
    <p>thanks,</p>
    <p>-Phil<br>
    </p>
    <div class="moz-cite-prefix">On 11/10/21 7:45 PM, Phil Carns wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:0d9a618a-6775-7100-ca16-6ff22529956a@mcs.anl.gov">
      
      <p>Just to follow up for folks on the list: we've made some
        headway on isolating the source of the delays.  It isn't network
        specific, but rather an issue in the Argobots ULT creation path
        on some platforms.  It's being tracked now at <a class="moz-txt-link-freetext" href="https://github.com/pmodels/argobots/issues/369" moz-do-not-send="true">https://github.com/pmodels/argobots/issues/369</a>
        with a standalone Argobots reproducer.  It should be a lot
        easier to hunt down from this point.<br>
      </p>
      <p>Some details:</p>
      <ul>
        <li>This only happens on particular machines (Frontera@TACC and
          Cooley@ALCF for example) but we don't know why just yet.<br>
        </li>
        <li>It only happens if you use dedicated threads (execution
          streams) for handling RPCs.  This is common for a lot of Mochi
          data services, though</li>
        <li>The test case also happens to highlight that we have some
          memory allocation/caching tuning work to do even on platforms
          that behave correctly.  We have a lot of options on that
          front, but we'll circle back to it later once the major
          platform-specific performance bug is resolved.</li>
      </ul>
      <p>thanks,</p>
      <p>-Phil<br>
      </p>
      <blockquote type="cite" cite="mid:7B252C38-02B7-40EE-A799-7856565F7C0E@rutgers.edu"> One
        more thing: I tried the same test case with a non-trivial size
        for the bulk pull (8MB) just to shift the timing some. I still
        see the extra time appear on the 4th and 8th timestep. I had
        wondered if the issue was that the timestep period was small
        enough that some background task was not completing in time, but
        that does not seem to be the case:
        <div class=""><br class="">
          <div class="">
            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024255773.000000,
              "dur": 76836.500000,"args":{"GUID":3,"Parent GUID":0}},</div>
            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024332702.500000,
              "dur": 3420.500000,"args":{"GUID":5,"Parent GUID":0}},</div>
            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024336178.250000,
              "dur": 3369.000000,"args":{"GUID":7,"Parent GUID":0}},</div>
            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024339577.500000,
              "dur": 5052.000000,"args":{"GUID":9,"Parent GUID":0}},</div>
            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024344658.750000,
              "dur": 3326.750000,"args":{"GUID":11,"Parent GUID":0}},</div>
            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024348016.500000,
              "dur": 3697.000000,"args":{"GUID":13,"Parent GUID":0}},</div>
            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024351743.500000,
              "dur": 3333.250000,"args":{"GUID":15,"Parent GUID":0}},</div>
            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024355106.000000,
              "dur": 4500.000000,"args":{"GUID":17,"Parent GUID":0}},</div>
            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024359634.500000,
              "dur": 3336.250000,"args":{"GUID":19,"Parent GUID":0}},</div>
            <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024362999.000000,
              "dur": 3296.000000,"args":{"GUID":21,"Parent GUID":0}},</div>
            <div><br class="">
              <blockquote type="cite" class="">
                <div class="">On Nov 5, 2021, at 12:40 PM, Phil Carns
                  <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>
                  wrote:</div>
                <br class="Apple-interchange-newline">
                <div class="">
                  <div class="">
                    <p class=""><br class="">
                    </p>
                    <div class="moz-cite-prefix">On 11/5/21 12:23 PM,
                      Philip Davis wrote:<br class="">
                    </div>
                    <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class=""> That’s a good find. I’m reading through
                      the code and documentation, and I’m having a
                      little trouble understanding what the difference
                      between 0 and -1 for that last argument is when
                      the second to last argument is 0. I see in
                      documentation:
                      <div class=""><br class="">
                      </div>
                      <div class=""><span style="caret-color: rgb(64,
                          64, 64); color: rgb(64, 64, 64); font-family:
                          Lato, proxima-nova, "Helvetica
                          Neue", Arial, sans-serif; font-size:
                          16px; background-color: rgb(252, 252, 252);" class="">The third argument indicates whether
                          an Argobots execution stream (ES) should be
                          created to run the Mercury progress loop. If
                          this argument is set to 0, the progress loop
                          is going to run in the context of the main ES
                          (this should be the standard scenario, unless
                          you have a good reason for not using the main
                          ES, such as the main ES using MPI primitives
                          that could block the progress loop). A value
                          of 1 will make Margo create an ES to run the
                          Mercury progress loop. The fourth argument is
                          the number of ES to create and use for
                          executing RPC handlers. A value of 0 will make
                          Margo execute RPCs in the ES that called </span><code class="code literal docutils notranslate" style="box-sizing: border-box; font-family:
                          SFMono-Regular, Menlo, Monaco, Consolas,
                          "Liberation Mono", "Courier
                          New", Courier, monospace; white-space:
                          nowrap; max-width: 100%; background-color:
                          rgb(255, 255, 255); border: 1px solid rgb(225,
                          228, 229); padding: 2px 5px; color: rgb(231,
                          76, 60); overflow-x: auto;"><span class="pre" style="box-sizing: border-box;">margo_init</span></code><span style="caret-color: rgb(64, 64, 64); color:
                          rgb(64, 64, 64); font-family: Lato,
                          proxima-nova, "Helvetica Neue",
                          Arial, sans-serif; font-size: 16px;
                          background-color: rgb(252, 252, 252);" class="">. A value of -1 will make Margo
                          execute the RPCs in the ES running the
                          progress loop. A positive value will make
                          Margo create new ESs to run the RPCs.</span></div>
                      <div class=""><font class="" size="3" face="Lato,
                          proxima-nova, Helvetica Neue, Arial,
                          sans-serif" color="#404040"><span style="caret-color: rgb(64, 64, 64);
                            background-color: rgb(252, 252, 252);" class=""><br class="">
                          </span></font></div>
                      <span class="">What is the difference between the
                        ‘main ES’ (last two arguments 0,-1) and the 'ES
                        that called margo_init’  (last two arguments
                        0,0) in the absence of me creating new execution
                        streams? Or maybe I’m not interpreting the
                        documentation correctly?<br class="">
                      </span></blockquote>
                    <p class=""><br class="">
                    </p>
                    <p class="">You are right, those are equivalent when
                      the next to last argument is 0 :)  The main ES and
                      progress thread ES are one and the same in that
                      case, so the RPCs go to the same place either way.</p>
                    <p class="">I've narrowed down a little further and
                      found that the stalls occur when there are 2
                      dedicated RPC ESs but not when there is just 1
                      dedicated RPC ES.  That isolates the problem
                      slightly further, in that it's not just some cost
                      associated with relaying ULTs to another thread
                      (that's happening every time in the 1 handler case
                      too), but something that only happens when
                      multiple ESes could potentially service an RPC.</p>
                    <p class="">thanks,</p>
                    <p class="">-Phil<br class="">
                      <span class=""></span></p>
                    <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class=""> <span class="">
                        <blockquote type="cite" class="">On Nov 5, 2021,
                          at 10:29 AM, Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>
                          wrote:<br class="">
                          <br class="">
                          Srinivasan Ramesh (U. Oregon) has done some
                          work on fine-grained RPC component timing, but
                          it's not in the mainline margo tree so we'll
                          need to do a little more work to look at it.<br class="">
                          <br class="">
                          In the mean time on a hunch I found that I can
                          make the latency consistent on Cooley by
                          altering the margo_init() arguments to
                          be (..., 0, -1) in server.c (meaning that no
                          additional execution streams are used at all;
                          all mercury progress and all rpc handlers are
                          executed using user-level threads in
                          the process's primary execution stream (OS
                          thread).<br class="">
                          <br class="">
                          It's expected that there would be some more
                          jitter jumping across OS threads for RPC
                          handling, but it shouldn't be that
                          extreme, regular, or system-specific.<br class="">
                          <br class="">
                          Thanks again for the test case and the Apex
                          instrumentation; this is the sort of thing
                          that's normally really hard to isolate.<br class="">
                          <br class="">
                          thanks,<br class="">
                          <br class="">
                          -Phil<br class="">
                          <br class="">
                          On 11/5/21 10:09 AM, Philip Davis wrote:<br class="">
                          <blockquote type="cite" cite="mid:B0643541-F015-4EB6-AD30-F9F85B196465@rutgers.edu" class=""> That’s extremely interesting. <br class="">
                            <br class="">
                            Are there any internal timers in Margo that
                            can tell what the delay was between the
                            server’s progress thread queueing the rpc
                            and the handler thread starting to handle
                            it? If I’m understanding <a href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmochi.readthedocs.io%2Fen%2Flatest%2Fgeneral%2F03_rpc_model.html&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411604270%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fgQJROtM5XQkNijao33AfJkl1HDFEX42Q2M0%2BR0POoo%3D&reserved=0" originalsrc="https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html" shash="oKne11hHUhUv1Ux+D9kgS9epPH1xWB0TI25yTBd0L4NOHfRBMy2+aJHs6rm2DNESeRYRUFw6LyClIuNSy4KENw9yKX/ow/F9jAkorj3zgD/SoAcwfh0mZ3vR+WKI5BzRxpAgn2MHn/Lw3EsYBIDPc5qgEWwmE8dcW8ige9YEzXo=" class="moz-txt-link-freetext" moz-do-not-send="true">https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html</a> correctly,
                            it seems to me like that is the most likely
                            place for non-deterministic delay to be
                            introduced by argobots in the client ->
                            server direction.<br class="">
                            <br class="">
                            I just ran a quick test where I changed the
                            number of handler threads to 5, and I saw no
                            change in behavior (still 4 and 8, not 5
                            and 10).<br class="">
                            <br class="">
                            <br class="">
                            <blockquote type="cite" class="">On Nov 4,
                              2021, at 9:04 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">
                                <carns@mcs.anl.gov></a> wrote:<br class="">
                              <br class="">
                              I have some more interesting info to
                              share from trying a few different
                              configurations.<br class="">
                              <br class="">
                              sm (on my laptop) and ofi+gni (on
                              theta) do not exhibit this behavior; they
                              have consistent performance across RPCs.<br class="">
                              <br class="">
                              ofi+verbs (on cooley) shows the same
                              thing you were seeing; the 4th and 8th
                              RPCs are slow.<br class="">
                              <br class="">
                              Based on the above, it would sound like
                              a problem with the libfabric/verbs path.
                               But on Jerome's suggestion I tried some
                              other supported transports on cooley as
                              well.  In particular I ran the same
                              benchmark (the same build in fact, I
                              just compiled in support for multiple
                              transports and cycling through them in a
                              single job script with runtime options)
                              with these combinations:<br class="">
                              <br class="">
                              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                                ofi+verbs<br class="">
                              </div>
                              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                                ofi+tcp<br class="">
                              </div>
                              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                                ofi+sockets<br class="">
                              </div>
                              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                                bmi+tcp<br class="">
                              </div>
                              All of them show the same thing!  4th
                              and 8th RPCs are at least an order of
                              magnitude slower than the other RPCs.
                               That was a surprising
                              result.  The bmi+tcp one isn't even using
                              libfabric at all, even though they are all
                              using the same underlying hardware.<br class="">
                              <br class="">
                              I'm not sure what to make of that
                              yet.  Possibly something with threading or
                              signalling?<br class="">
                              <br class="">
                              thanks,<br class="">
                              <br class="">
                              -Phil<br class="">
                              <br class="">
                              On 11/2/21 2:37 PM, Philip Davis wrote:<br class="">
                              <blockquote type="cite" cite="mid:82DF3EFD-9A7E-414C-86F7-1894000C30E1@rutgers.edu" class=""> I’m glad you were able to
                                reproduce it on a different system,
                                thanks for letting me know. I’m not sure
                                what the overlaps between Frontera
                                and Cooley are (that aren’t shared by
                                Summit); a quick look shows they are
                                both Intel, and both FDR, but there’s
                                probably more salient details.<br class="">
                                <br class="">
                                <blockquote type="cite" class="">On Nov
                                  2, 2021, at 2:24 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">
                                    <carns@mcs.anl.gov></a> wrote:<br class="">
                                  <br class="">
                                  Ok.  Interesting.  I didn't see
                                  anything unusual in the timing on
                                  my laptop with sm (other than it being
                                  a bit slow, but I wasn't tuning
                                  or worrying about core affinity
                                  or anything).  On Cooley, a somewhat
                                  older Linux cluster with InfiniBand, I
                                  see the 4th and 8th RPC delay you were
                                  talking about:<br class="">
                                  <br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent GUID":0}},<br class="">
                                  <br class="">
                                  (assuming the first is high due to
                                  connection establishment)<br class="">
                                  <br class="">
                                  I'll check some
                                  other systems/transports, but I wanted
                                  to go ahead and share that I've been
                                  able to reproduce what you were
                                  seeing.<br class="">
                                  <br class="">
                                  thanks,<br class="">
                                  <br class="">
                                  -Phil<br class="">
                                  <br class="">
                                  On 11/2/21 1:49 PM, Philip Davis
                                  wrote:<br class="">
                                  <blockquote type="cite" cite="mid:35E29989-49C6-4AA0-8C41-54D97F1ACBF6@rutgers.edu" class=""> Glad that’s working now.<br class="">
                                    <br class="">
                                    It is the put_wait events, and “dur”
                                    is the right field. Those units
                                    are microseconds.<br class="">
                                    <br class="">
                                    <br class="">
                                    <blockquote type="cite" class="">On
                                      Nov 2, 2021, at 1:12 PM, Phil
                                      Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">
                                        <carns@mcs.anl.gov></a> wrote:<br class="">
                                      <br class="">
                                      Thanks Philip, the "=
                                      {0};" initialization of
                                      that struct got me going.<br class="">
                                      <br class="">
                                      I can run the test case now and it
                                      is producing output in the client
                                      and server perf dirs.  Just to
                                      sanity check what to look for,
                                      I think the problem should be
                                      exhibited in the "put_wait" or
                                      maybe "do_put" trace events on the
                                      client?  For example on my laptop
                                      I see this:<br class="">
                                      <br class="">
carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep
                                      do_put trace_events.0.json<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent GUID":0}},<br class="">
carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait trace_events.0.json<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent GUID":0}},<br class="">
                                      <br class="">
                                      I should look at the "dur" field
                                      right?  What are the units
                                      on that?<br class="">
                                      <br class="">
                                      I'll see if I can run this on a
                                      "real" system shortly.<br class="">
                                      <br class="">
                                      thanks!<br class="">
                                      <br class="">
                                      -Phil<br class="">
                                      <br class="">
                                      On 11/2/21 12:11 PM, Philip Davis
                                      wrote:<br class="">
                                      <blockquote type="cite" cite="mid:104B5045-328B-48EE-A32C-09D25817EED7@rutgers.edu" class=""> Hi Phil,<br class="">
                                        <br class="">
                                        Sorry the data structures are
                                        like that; I wanted to preserve
                                        as much of the RPC size and
                                        ordering in case it ended up
                                        being important.<br class="">
                                        <br class="">
                                        I’m surprised in.odsc.size is
                                        troublesome, as I set
                                        in.odsc.size with the line
                                        `in.odsc.size = sizeof(odsc);`.
                                        I’m not sure what could
                                        be corrupting that value in the
                                        meantime.<br class="">
                                        <br class="">
                                        I don’t set
                                        in.odsc.gdim_size (which was an
                                        oversight, since that’s
                                        non-zero normally), so I’m
                                        less surprised that’s an issue.
                                        I thought I initialized `in`
                                        to zero, but I see I didn’t do
                                        that after all.<br class="">
                                        <br class="">
                                        Maybe change the line
                                        `bulk_gdim_t in;` to
                                        `bulk_gdim_t in = {0};`<br class="">
                                        <br class="">
                                        <br class="">
                                        <br class="">
                                        <blockquote type="cite" class="">On
                                          Nov 2, 2021, at 11:48 AM, Phil
                                          Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true"><carns@mcs.anl.gov></a> wrote:<br class="">
                                          <br class="">
                                          Awesome, thanks Philip.
                                           It came through fine.<br class="">
                                          <br class="">
                                          I started by modifying the job
                                          script slightly to just run it
                                          on my laptop with sm (I
                                          wanted to make sure
                                          I understood the test case,
                                          and how to use apex,
                                          before trying elsewhere).  Does
                                          in.size needs to be set
                                          in client.c?  For me there is
                                          a random value in that
                                          field and it is causing the  
                                                                       
                                                                 
                                            encoder on the forward
                                          to attempt a
                                          very large allocation.  The
                                          same might be true
                                          of gdim_size if it got
                                          past that step.  I started
                                          to alter them but then I
                                          wasn't sure what
                                          the implications were.<br class="">
                                          <br class="">
                                          (fwiw I needed
                                          to include stdlib.h
                                          in common.h, but I've hit
                                          that a couple of times
                                          recently on codes
                                          that didn't previously generate warnings;
                                          I think something in Ubuntu
                                          has gotten strict about
                                          that recently).<br class="">
                                          <br class="">
                                          thanks,<br class="">
                                          <br class="">
                                          -Phil<br class="">
                                          <br class="">
                                          <br class="">
                                          <br class="">
                                          On 11/1/21 4:51 PM,
                                          Philip Davis wrote:<br class="">
                                          <blockquote type="cite" cite="mid:67D76603-B4B9-48E5-9452-966D1E4866D2@rutgers.edu" class=""> Hi Phil,<br class="">
                                            <br class="">
                                            I’ve attached
                                            the reproducer. I see the
                                            4th and 8th issue on
                                            Frontera, but
                                            not Summit. Hopefully
                                            it will build and run
                                            without too
                                            much modification. Let me
                                            know if there are any issues
                                            with running it (or if the
                                            anl listserv eats the
                                            tarball, which I kind of
                                            expect).<br class="">
                                            <br class="">
                                            Thanks,<br class="">
                                            Philip<br class="">
                                            <br class="">
                                            <br class="">
                                            <blockquote type="cite" class="">On Nov 1,
                                              2021, at 11:14 AM, Phil
                                              Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true"><carns@mcs.anl.gov></a> wrote:<br class="">
                                              <br class="">
                                              Hi Philip,<br class="">
                                              <br class="">
                                              (FYI I think the first
                                              image didn't come through
                                              in your email, but I
                                              think the others
                                              are sufficient to get
                                              across what you
                                              are seeing)<br class="">
                                              <br class="">
                                              I don't have any idea
                                              what would cause that.
                                               The recently released libfabric 1.13.2 (available
                                              in spack from
                                              the mochi-spack-packages repo)
                                              includes some fixes to the
                                              rxm provider that could
                                              be relevant to Frontera
                                              and Summit, but nothing
                                              that aligns with what you
                                              are observing.<br class="">
                                              <br class="">
                                              If it were later in
                                              the sequence (much later)
                                              I would speculate
                                              that memory allocation/deallocation cycles
                                              were eventually causing
                                              a hiccup.  We've seen
                                              something like that in the
                                              past, and it's a
                                              theory that we could then
                                              test
                                              with alternative allocators like jemalloc.  That's
not memory allocation jitter that early in the run though.<br class="">
                                              <br class="">
                                              Please do share
                                              your reproducer if you
                                              don't mind!  We can try a
                                              few systems here to at
                                              least isolate if it is
                                              something peculiar
                                              to the InfiniBand path or
                                              if there is a more
                                              general problem in Margo.<br class="">
                                              <br class="">
                                              thanks,<br class="">
                                              <br class="">
                                              -Phil<br class="">
                                              <br class="">
                                              On 10/29/21 3:20 PM,
                                              Philip Davis wrote:<br class="">
                                              <blockquote type="cite" class="">Hello,<br class="">
                                                <br class="">
                                                I apologize in advance
                                                for the winding nature
                                                of this email. I’m
                                                not sure how to ask
                                                my question without explaining
                                                the story of my results
                                                some.<br class="">
                                                <br class="">
                                                I’m doing
                                                some characterization of
                                                our server performance
                                                under load, and I have
                                                a quirk
                                                of performance that I
                                                wanted to run
                                                by people to see if they
                                                make sense. My testing
                                                so far has been
                                                to iteratively send
                                                batches of RPCs
                                                using margo_iforward,
                                                and then measure
                                                the wait time until they
                                                are all complete. On the
                                                server side,
                                                handling the
                                                RPC includes
                                                a margo_bulk_transfer as
                                                a pull initiated
                                                on the server to pull
                                                (for now) 8 bytes.
                                                The payload of the RPC
                                                request is about
                                                500 bytes, and
                                                the response payload is
                                                4 bytes.<br class="">
                                                <br class="">
                                                I’ve isolated my results
                                                down to one server rank
                                                and one client
                                                rank, because it’s an
                                                easier starting point to
                                                reason from. Below
                                                is a graph of some of
                                                my initial results.
                                                These results are from
                                                Frontera. The
                                                median times are good (a
                                                single RPC takes on
                                                the order of 10s        
                                                                       
                                                                       
                                                  of microseconds, which
seems fantastic). However, the outliers are fairly high (note the
                                                log scale of
                                                the y-axis). With only
                                                one
                                                RPC per timestep, for
                                                example, there is a
                                                30x spread between the
                                                median and the max.<br class="">
                                                <br class="">
                                                <img id="x_126DD6C1-A420-4D70-8D1C-96320B7F54E7" src="cid:5EB5DF63-97AA-48FC-8BBE-E666E933D79F" class="" moz-do-not-send="true"><br class="">
                                                <br class="">
                                                I was
                                                hoping (expecting) the
                                                first timestep would be
                                                where the long
                                                replies resided,
                                                but that turned out not
                                                to be the case. Below
                                                are traces from the 1
                                                RPC (blue) and 2 RPC
                                                 (orange) per
                                                timestep cases, 5 trials
                                                of 10 timesteps for each
                                                case (normalized to fix
                                                the same                
                                                                       
                                                               
                                                  x-axis):<br class="">
                                                <br class="">
<PastedGraphic-6.png><br class="">
                                                <br class="">
                                                What strikes me is how
                                                consistent these
                                                results are
                                                across trials. For the 1
                                                RPC per timestep
                                                case, the 3rd and 7th
                                                timestep are consistently slow (and
                                                the rest are fast).
                                                For the 2 RPC
                                                per timestep case, the
                                                2nd and 4th timestep are
                                                always slow
                                                and sometimes the      
                                                                       
                                                                       
                                                    10th is.
                                                These results
                                                are repeatable with very
                                                rare variation.<br class="">
                                                <br class="">
                                                For the single RPC case,
                                                I recorded some timers
                                                on the server side, and
                                                attempted to
                                                overlay them with
                                                the client side (there
                                                is some unknown offset,
                                                but probably on the
                                                order of 10s
                                                of microseconds at
                                                worst, given
                                                the pattern):<br class="">
                                                <br class="">
<PastedGraphic-7.png><br class="">
                                                <br class="">
                                                I blew up the first
                                                few timesteps of one of
                                                the trials:<br class="">
<PastedGraphic-8.png><br class="">
                                                <br class="">
                                                The different colors
                                                are different segments
                                                of the handler, but
                                                there doesn’t seem to be
anything too interesting going on inside the handler. So it looks like
                                                the time is
                                                being introduced before
                                                the 3rd RPC
                                                handler starts, based on
                                                the where the
                                                gap appears on
                                                the server side.<br class="">
                                                <br class="">
                                                To try and isolate
                                                any dataspaces-specific behavior,
                                                I created a pure Margo
                                                test case that
                                                just sends a single rpc
                                                of the same size
                                                as dataspaces iteratively, whre
                                                the server side does an
                                                8-byte bulk
                                                transfer initiated
                                                by the server, and sends
                                                a response. The results
                                                are similar, except that
                                                it is now the 4th and
                                                8th timestep that are
                                                slow (and the
                                                first timestep is VERY
                                                long, presumably because
                                                rxm communication state
                                                is
being established. DataSpaces has an earlier RPC in its init that
                                                was absorbing
                                                this latency).<br class="">
                                                <br class="">
                                                I got
                                                margo profiling results
                                                for this test case:<br class="">
                                                <br class="">
                                                ```<br class="">
                                                3<br class="">
18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573<br class="">
                                                0xa2a1,term_rpc<br class="">
                                                0x27b5,put_rpc<br class="">
                                                0xd320,__shutdown__<br class="">
0x27b5 ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
0x27b5 ,0;0.041241646,200.000000000, 0;<br class="">
0xa2a1 ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
0xa2a1 ,0;0.000009298,1.000000000, 0;<br class="">
                                                ```<br class="">
                                                <br class="">
                                                So I guess my question
                                                at this point is, is
                                                there
                                                any sensible reason why
                                                the 4th and 8th RPC sent
                                                would have a
                                                long response
                                                time? I think
                                                I’ve cleared my code on
                                                the client side and
                                                server side, so
                                                it appears to be latency
                                                being introduced
                                                by Margo, LibFabric, Argobots,
                                                or the underlying OS. I
                                                do see long
                                                timesteps occasionally after
                                                this (perhaps
                                                every 20-30 timesteps)
                                                but these are
                                                not consistent.<br class="">
                                                <br class="">
                                                One last detail: this
                                                does not happen
                                                on Summit. On summit, I
                                                see about
                                                5-7x worse single-RPC performance (250-350 microseconds per
                                                RPC), but without
                                                the intermittent long timesteps.<br class="">
                                                <br class="">
                                                I can provide the
                                                minimal test case if it
                                                would be helpful. I
                                                am using APEX
                                                for timing results,
                                                and the
                                                following dependencies with
                                                Spack:<br class="">
                                                <br class="">
                                                <a class="moz-txt-link-abbreviated
                                                  moz-txt-link-freetext" href="mailto:argobots@1.1" moz-do-not-send="true">argobots@1.1</a>  <a class="moz-txt-link-abbreviated
                                                  moz-txt-link-freetext" href="mailto:json-c@0.15" moz-do-not-send="true">json-c@0.15</a>  <a class="moz-txt-link-abbreviated
                                                  moz-txt-link-freetext" href="mailto:libfabric@1.13.1" moz-do-not-send="true">libfabric@1.13.1</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:mercury@2.0.1" moz-do-not-send="true">mercury@2.0.1</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:mochi-margo@0.9.5" moz-do-not-send="true">mochi-margo@0.9.5</a>  rdma-core@20<br class="">
                                                <br class="">
                                                Thanks,<br class="">
                                                Philip<br class="">
                                                <br class="">
                                                <br class="">
                                                <br class="">
                                                <br class="">
                                                <div class="">_______________________________________________</div>
                                                <div class="">mochi-devel
                                                  mailing list</div>
                                                <br class="Apple-interchange-newline">
                                                <a class="moz-txt-link-abbreviated
                                                  moz-txt-link-freetext" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">
                                                <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=EiX9%2B4KlSbKB7w2ov4OZtE7jCugYdchESAfX0D%2B%2FK7c%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="hT2kkjAIf/zHJhvtfpngbaSAG6BZQ0v653UkY5DV3lvADy7nxpFwHs9680ZLFYC4A/ZkOaDhVaoVT0dhguoFth03P3pU/ankBuXMiNnP4KyfpClP97DwHHcU9nwM04kJb9eamI6B/jo5QLEuZAQ97t2SUt+s9UTzpXLcXc/6/7I=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
                                                <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vyMI4AQ%2BbXY4WVJocNVN%2BYJp6H1tYsP7meHgbjriejE%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="ncm2uVm0Ugenklbk5rqBtY39R8xrytsxPVYOdy2NqnuC3nOSGasq7//KT8rW9cbT43LvCOivxAxgLyWaANsZOv0dwLdcqrgqnldIxtfDSfJzCXFTgmgI++3ODNu7UIcNUeMtmBw8YYusppzZlNxvPHAunqN44IwZIk9lYE5wtW0=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
                                              </blockquote>
_______________________________________________<br class="">
                                              mochi-devel mailing list<br class="">
                                              <a class="moz-txt-link-abbreviated
                                                moz-txt-link-freetext" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">
                                              <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411624266%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oVLJaNkVau19KFuQsfeEpIkYUBvfpHcNBiMwy39eYiE%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="eo+nqLYWCkMMdToapGKW5gCChdEC+hnMkybaq9TTMc+BLlej8fWIrt1WeKWex1ubGz/Kolyb76ecsFMW3CL4Je2n/gnoO704YF3Or7fc73FFazgqUg3EPFQRAx3+zz5WR7a1+QiDa2HcoGRw4/KG7diKB6bVs37W+tX6YsHhBJU=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
                                              <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411634259%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XjXCEsV2TNbLfPuH7iwMuolSCtaxnJZksVRu7V%2FfGNw%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="euRQ8SoMMl/IdjKCp7PxhwDgnaOLyG1nlSdl2QPAU17a7nYFM9JBPnoh1O/jSGuGv0+KkUsZMdBbbwg0ENFlNOUiYCxv1aPqeW0whx609AFP7y+OEMFUuRm9Ir3OvZlY4jzkE5B2v68VbQSLyROuz/2i8WR42H/ni5Na2Gn9noc=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
                                            </blockquote>
                                            <br class="">
                                          </blockquote>
                                        </blockquote>
                                        <br class="">
                                      </blockquote>
                                    </blockquote>
                                    <br class="">
                                  </blockquote>
                                </blockquote>
                                <br class="">
                              </blockquote>
                            </blockquote>
                            <br class="">
                          </blockquote>
                        </blockquote>
                        <br class="">
                      </span></blockquote>
                  </div>
                </div>
              </blockquote>
            </div>
            <br class="">
          </div>
        </div>
      </blockquote>
    </blockquote>
  </body>
</html>