<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>Just to follow up for folks on the list: we've made some headway
      on isolating the source of the delays.  It isn't network specific,
      but rather an issue in the Argobots ULT creation path on some
      platforms.  It's being tracked now at
      <a class="moz-txt-link-freetext" href="https://github.com/pmodels/argobots/issues/369">https://github.com/pmodels/argobots/issues/369</a> with a standalone
      Argobots reproducer.  It should be a lot easier to hunt down from
      this point.<br>
    </p>
    <p>Some details:</p>
    <ul>
      <li>This only happens on particular machines (Frontera@TACC and
        Cooley@ALCF for example) but we don't know why just yet.<br>
      </li>
      <li>It only happens if you use dedicated threads (execution
        streams) for handling RPCs.  This is common for a lot of Mochi
        data services, though</li>
      <li>The test case also happens to highlight that we have some
        memory allocation/caching tuning work to do even on platforms
        that behave correctly.  We have a lot of options on that front,
        but we'll circle back to it later once the major
        platform-specific performance bug is resolved.</li>
    </ul>
    <p>thanks,</p>
    <p>-Phil<br>
    </p>
    <blockquote type="cite" cite="mid:7B252C38-02B7-40EE-A799-7856565F7C0E@rutgers.edu">
      
      One more thing: I tried the same test case with a non-trivial size
      for the bulk pull (8MB) just to shift the timing some. I still see
      the extra time appear on the 4th and 8th timestep. I had wondered
      if the issue was that the timestep period was small enough that
      some background task was not completing in time, but that does not
      seem to be the case:
      <div class=""><br class="">
        <div class="">
          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024255773.000000,
            "dur": 76836.500000,"args":{"GUID":3,"Parent GUID":0}},</div>
          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024332702.500000,
            "dur": 3420.500000,"args":{"GUID":5,"Parent GUID":0}},</div>
          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024336178.250000,
            "dur": 3369.000000,"args":{"GUID":7,"Parent GUID":0}},</div>
          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024339577.500000,
            "dur": 5052.000000,"args":{"GUID":9,"Parent GUID":0}},</div>
          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024344658.750000,
            "dur": 3326.750000,"args":{"GUID":11,"Parent GUID":0}},</div>
          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024348016.500000,
            "dur": 3697.000000,"args":{"GUID":13,"Parent GUID":0}},</div>
          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024351743.500000,
            "dur": 3333.250000,"args":{"GUID":15,"Parent GUID":0}},</div>
          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024355106.000000,
            "dur": 4500.000000,"args":{"GUID":17,"Parent GUID":0}},</div>
          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024359634.500000,
            "dur": 3336.250000,"args":{"GUID":19,"Parent GUID":0}},</div>
          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024362999.000000,
            "dur": 3296.000000,"args":{"GUID":21,"Parent GUID":0}},</div>
          <div><br class="">
            <blockquote type="cite" class="">
              <div class="">On Nov 5, 2021, at 12:40 PM, Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>
                wrote:</div>
              <br class="Apple-interchange-newline">
              <div class="">
                <div class="">
                  <p class=""><br class="">
                  </p>
                  <div class="moz-cite-prefix">On 11/5/21 12:23 PM,
                    Philip Davis wrote:<br class="">
                  </div>
                  <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class="">
                    That’s a good find. I’m reading through the code and
                    documentation, and I’m having a little trouble
                    understanding what the difference between 0 and -1
                    for that last argument is when the second to last
                    argument is 0. I see in documentation:
                    <div class=""><br class="">
                    </div>
                    <div class=""><span style="caret-color: rgb(64, 64,
                        64); color: rgb(64, 64, 64); font-family: Lato,
                        proxima-nova, "Helvetica Neue", Arial,
                        sans-serif; font-size: 16px; background-color:
                        rgb(252, 252, 252);" class="">The third argument
                        indicates whether an Argobots execution stream
                        (ES) should be created to run the Mercury
                        progress loop. If this argument is set to 0, the
                        progress loop is going to run in the context of
                        the main ES (this should be the standard
                        scenario, unless you have a good reason for not
                        using the main ES, such as the main ES using MPI
                        primitives that could block the progress loop).
                        A value of 1 will make Margo create an ES to run
                        the Mercury progress loop. The fourth argument
                        is the number of ES to create and use for
                        executing RPC handlers. A value of 0 will make
                        Margo execute RPCs in the ES that called </span><code class="code literal docutils notranslate" style="box-sizing: border-box; font-family:
                        SFMono-Regular, Menlo, Monaco, Consolas,
                        "Liberation Mono", "Courier
                        New", Courier, monospace; white-space:
                        nowrap; max-width: 100%; background-color:
                        rgb(255, 255, 255); border: 1px solid rgb(225,
                        228, 229); padding: 2px 5px; color: rgb(231, 76,
                        60); overflow-x: auto;"><span class="pre" style="box-sizing: border-box;">margo_init</span></code><span style="caret-color: rgb(64, 64, 64); color:
                        rgb(64, 64, 64); font-family: Lato,
                        proxima-nova, "Helvetica Neue", Arial,
                        sans-serif; font-size: 16px; background-color:
                        rgb(252, 252, 252);" class="">. A value of -1
                        will make Margo execute the RPCs in the ES
                        running the progress loop. A positive value will
                        make Margo create new ESs to run the RPCs.</span></div>
                    <div class=""><font class="" size="3" face="Lato,
                        proxima-nova, Helvetica Neue, Arial, sans-serif" color="#404040"><span style="caret-color:
                          rgb(64, 64, 64); background-color: rgb(252,
                          252, 252);" class=""><br class="">
                        </span></font></div>
                    <span class="">What is the difference between the
                      ‘main ES’ (last two arguments 0,-1) and the 'ES
                      that called margo_init’  (last two arguments 0,0)
                      in the absence of me creating new execution
                      streams? Or maybe I’m not interpreting the
                      documentation correctly?<br class="">
                    </span></blockquote>
                  <p class=""><br class="">
                  </p>
                  <p class="">You are right, those are equivalent when
                    the next to last argument is 0 :)  The main ES and
                    progress thread ES are one and the same in that
                    case, so the RPCs go to the same place either way.</p>
                  <p class="">I've narrowed down a little further and
                    found that the stalls occur when there are 2
                    dedicated RPC ESs but not when there is just 1
                    dedicated RPC ES.  That isolates the problem
                    slightly further, in that it's not just some cost
                    associated with relaying ULTs to another thread
                    (that's happening every time in the 1 handler case
                    too), but something that only happens when multiple
                    ESes could potentially service an RPC.</p>
                  <p class="">thanks,</p>
                  <p class="">-Phil<br class="">
                    <span class=""></span></p>
                  <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class="">
                    <span class="">
                      <blockquote type="cite" class="">On Nov 5, 2021,
                        at 10:29 AM, Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>
                        wrote:<br class="">
                        <br class="">
                        Srinivasan Ramesh (U. Oregon) has done some work
                        on fine-grained RPC component timing, but it's
                        not in the mainline margo tree so we'll need to
                        do a little more work to look at it.<br class="">
                        <br class="">
                        In the mean time on a hunch I found that I can
                        make the latency consistent on Cooley by
                        altering the margo_init() arguments to be (...,
                        0, -1) in server.c (meaning that no additional
                        execution streams are used at all; all mercury
                        progress and all rpc handlers are executed using
                        user-level threads in the process's
                        primary execution stream (OS thread).<br class="">
                        <br class="">
                        It's expected that there would be some more
                        jitter jumping across OS threads for RPC
                        handling, but it shouldn't be that
                        extreme, regular, or system-specific.<br class="">
                        <br class="">
                        Thanks again for the test case and the Apex
                        instrumentation; this is the sort of thing
                        that's normally really hard to isolate.<br class="">
                        <br class="">
                        thanks,<br class="">
                        <br class="">
                        -Phil<br class="">
                        <br class="">
                        On 11/5/21 10:09 AM, Philip Davis wrote:<br class="">
                        <blockquote type="cite" cite="mid:B0643541-F015-4EB6-AD30-F9F85B196465@rutgers.edu" class="">
                          That’s extremely interesting. <br class="">
                          <br class="">
                          Are there any internal timers in Margo that
                          can tell what the delay was between the
                          server’s progress thread queueing the rpc and
                          the handler thread starting to handle it? If
                          I’m understanding <a href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmochi.readthedocs.io%2Fen%2Flatest%2Fgeneral%2F03_rpc_model.html&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411604270%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fgQJROtM5XQkNijao33AfJkl1HDFEX42Q2M0%2BR0POoo%3D&reserved=0" originalsrc="https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html" shash="oKne11hHUhUv1Ux+D9kgS9epPH1xWB0TI25yTBd0L4NOHfRBMy2+aJHs6rm2DNESeRYRUFw6LyClIuNSy4KENw9yKX/ow/F9jAkorj3zgD/SoAcwfh0mZ3vR+WKI5BzRxpAgn2MHn/Lw3EsYBIDPc5qgEWwmE8dcW8ige9YEzXo=" class="moz-txt-link-freetext" moz-do-not-send="true">https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html</a> correctly,
                          it seems to me like that is the most likely
                          place for non-deterministic delay to be
                          introduced by argobots in the client ->
                          server direction.<br class="">
                          <br class="">
                          I just ran a quick test where I changed the
                          number of handler threads to 5, and I saw no
                          change in behavior (still 4 and 8, not 5
                          and 10).<br class="">
                          <br class="">
                          <br class="">
                          <blockquote type="cite" class="">On Nov 4,
                            2021, at 9:04 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">
                              <carns@mcs.anl.gov></a> wrote:<br class="">
                            <br class="">
                            I have some more interesting info to
                            share from trying a few different
                            configurations.<br class="">
                            <br class="">
                            sm (on my laptop) and ofi+gni (on theta) do
                            not exhibit this behavior; they have
                            consistent performance across RPCs.<br class="">
                            <br class="">
                            ofi+verbs (on cooley) shows the same
                            thing you were seeing; the 4th and 8th RPCs
                            are slow.<br class="">
                            <br class="">
                            Based on the above, it would sound like
                            a problem with the libfabric/verbs path.
                             But on Jerome's suggestion I tried some
                            other supported transports on cooley as
                            well.  In particular I ran the same
                            benchmark (the same build in fact, I
                            just compiled in support for multiple
                            transports and cycling through them in a
                            single job script with runtime options) with
                            these combinations:<br class="">
                            <br class="">
                            <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                              ofi+verbs<br class="">
                            </div>
                            <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                              ofi+tcp<br class="">
                            </div>
                            <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                              ofi+sockets<br class="">
                            </div>
                            <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                              bmi+tcp<br class="">
                            </div>
                            All of them show the same thing!  4th
                            and 8th RPCs are at least an order of
                            magnitude slower than the other RPCs.  That
                            was a surprising result.  The bmi+tcp one
                            isn't even using libfabric at all, even
                            though they are all using the same
                            underlying hardware.<br class="">
                            <br class="">
                            I'm not sure what to make of that
                            yet.  Possibly something with threading or
                            signalling?<br class="">
                            <br class="">
                            thanks,<br class="">
                            <br class="">
                            -Phil<br class="">
                            <br class="">
                            On 11/2/21 2:37 PM, Philip Davis wrote:<br class="">
                            <blockquote type="cite" cite="mid:82DF3EFD-9A7E-414C-86F7-1894000C30E1@rutgers.edu" class="">
                              I’m glad you were able to reproduce it on
                              a different system, thanks for letting me
                              know. I’m not sure what the overlaps
                              between Frontera and Cooley are (that
                              aren’t shared by Summit); a quick look
                              shows they are both Intel, and both FDR,
                              but there’s probably more salient details.<br class="">
                              <br class="">
                              <blockquote type="cite" class="">On Nov 2,
                                2021, at 2:24 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">
                                  <carns@mcs.anl.gov></a> wrote:<br class="">
                                <br class="">
                                Ok.  Interesting.  I didn't see anything
                                unusual in the timing on my laptop with
                                sm (other than it being a bit slow, but
                                I wasn't tuning or worrying about core
                                affinity or anything).  On Cooley, a
                                somewhat older Linux cluster with
                                InfiniBand, I see the 4th and 8th RPC
                                delay you were talking about:<br class="">
                                <br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent GUID":0}},<br class="">
                                <br class="">
                                (assuming the first is high due to
                                connection establishment)<br class="">
                                <br class="">
                                I'll check some
                                other systems/transports, but I wanted
                                to go ahead and share that I've been
                                able to reproduce what you were seeing.<br class="">
                                <br class="">
                                thanks,<br class="">
                                <br class="">
                                -Phil<br class="">
                                <br class="">
                                On 11/2/21 1:49 PM, Philip Davis wrote:<br class="">
                                <blockquote type="cite" cite="mid:35E29989-49C6-4AA0-8C41-54D97F1ACBF6@rutgers.edu" class="">
                                  Glad that’s working now.<br class="">
                                  <br class="">
                                  It is the put_wait events, and “dur”
                                  is the right field. Those units
                                  are microseconds.<br class="">
                                  <br class="">
                                  <br class="">
                                  <blockquote type="cite" class="">On
                                    Nov 2, 2021, at 1:12 PM, Phil Carns
                                    <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">
                                      <carns@mcs.anl.gov></a> wrote:<br class="">
                                    <br class="">
                                    Thanks Philip, the "=
                                    {0};" initialization of that struct
                                    got me going.<br class="">
                                    <br class="">
                                    I can run the test case now and it
                                    is producing output in the client
                                    and server perf dirs.  Just to
                                    sanity check what to look for,
                                    I think the problem should be
                                    exhibited in the "put_wait" or
                                    maybe "do_put" trace events on the
                                    client?  For example on my laptop I
                                    see this:<br class="">
                                    <br class="">
carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep
                                    do_put trace_events.0.json<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent GUID":0}},<br class="">
carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait trace_events.0.json<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent GUID":0}},<br class="">
                                    <br class="">
                                    I should look at the "dur" field
                                    right?  What are the units on that?<br class="">
                                    <br class="">
                                    I'll see if I can run this on a
                                    "real" system shortly.<br class="">
                                    <br class="">
                                    thanks!<br class="">
                                    <br class="">
                                    -Phil<br class="">
                                    <br class="">
                                    On 11/2/21 12:11 PM, Philip Davis
                                    wrote:<br class="">
                                    <blockquote type="cite" cite="mid:104B5045-328B-48EE-A32C-09D25817EED7@rutgers.edu" class="">
                                      Hi Phil,<br class="">
                                      <br class="">
                                      Sorry the data structures are
                                      like that; I wanted to preserve as
                                      much of the RPC size and ordering
                                      in case it ended up
                                      being important.<br class="">
                                      <br class="">
                                      I’m surprised in.odsc.size is
                                      troublesome, as I set in.odsc.size
                                      with the line `in.odsc.size
                                      = sizeof(odsc);`. I’m not sure
                                      what could be corrupting that
                                      value in the meantime.<br class="">
                                      <br class="">
                                      I don’t set
                                      in.odsc.gdim_size (which was an
                                      oversight, since that’s
                                      non-zero normally), so I’m
                                      less surprised that’s an issue. I
                                      thought I initialized `in`
                                      to zero, but I see I didn’t do
                                      that after all.<br class="">
                                      <br class="">
                                      Maybe change the line
                                      `bulk_gdim_t in;` to `bulk_gdim_t
                                      in = {0};`<br class="">
                                      <br class="">
                                      <br class="">
                                      <br class="">
                                      <blockquote type="cite" class="">On
                                        Nov 2, 2021, at 11:48 AM, Phil
                                        Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true"><carns@mcs.anl.gov></a> wrote:<br class="">
                                        <br class="">
                                        Awesome, thanks Philip.  It came
                                        through fine.<br class="">
                                        <br class="">
                                        I started by modifying the job
                                        script slightly to just run it
                                        on my laptop with sm (I
                                        wanted to make sure I understood
                                        the test case, and how
                                        to use apex,
                                        before trying elsewhere).  Does
                                        in.size needs to be set
                                        in client.c?  For me there is
                                        a random value in that field and
                                        it is causing the              
                                                                       
                                                    encoder on
                                        the forward to attempt a
                                        very large allocation.  The same
                                        might be true of gdim_size if it
                                        got past that step.  I started
                                        to alter them but then I
                                        wasn't sure what
                                        the implications were.<br class="">
                                        <br class="">
                                        (fwiw I needed
                                        to include stdlib.h in common.h,
                                        but I've hit that a couple
                                        of times recently on codes
                                        that didn't previously generate warnings;
                                        I think something in Ubuntu
                                        has gotten strict about
                                        that recently).<br class="">
                                        <br class="">
                                        thanks,<br class="">
                                        <br class="">
                                        -Phil<br class="">
                                        <br class="">
                                        <br class="">
                                        <br class="">
                                        On 11/1/21 4:51 PM, Philip Davis
                                        wrote:<br class="">
                                        <blockquote type="cite" cite="mid:67D76603-B4B9-48E5-9452-966D1E4866D2@rutgers.edu" class="">
                                          Hi Phil,<br class="">
                                          <br class="">
                                          I’ve attached the reproducer.
                                          I see the 4th and 8th issue on
                                          Frontera, but
                                          not Summit. Hopefully it will
                                          build and run without too
                                          much modification. Let me know
                                          if there are any issues
                                          with running it (or if the
                                          anl listserv eats the
                                          tarball, which I kind of
                                          expect).<br class="">
                                          <br class="">
                                          Thanks,<br class="">
                                          Philip<br class="">
                                          <br class="">
                                          <br class="">
                                          <blockquote type="cite" class="">On Nov 1, 2021, at
                                            11:14 AM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true"><carns@mcs.anl.gov></a> wrote:<br class="">
                                            <br class="">
                                            Hi Philip,<br class="">
                                            <br class="">
                                            (FYI I think the first
                                            image didn't come through
                                            in your email, but I
                                            think the others
                                            are sufficient to get
                                            across what you are seeing)<br class="">
                                            <br class="">
                                            I don't have any idea
                                            what would cause that.
                                             The recently released libfabric 1.13.2 (available
                                            in spack from
                                            the mochi-spack-packages repo)
                                            includes some fixes to the
                                            rxm provider that could
                                            be relevant to Frontera
                                            and Summit, but nothing
                                            that aligns with what you
                                            are observing.<br class="">
                                            <br class="">
                                            If it were later in
                                            the sequence (much later) I
                                            would speculate
                                            that memory allocation/deallocation cycles
                                            were eventually causing
                                            a hiccup.  We've seen
                                            something like that in the
                                            past, and it's a theory that
                                            we could then test
                                            with alternative allocators like jemalloc.  That's
                                            not memory allocation jitter
                                            that early in the run
                                            though.<br class="">
                                            <br class="">
                                            Please do share
                                            your reproducer if you
                                            don't mind!  We can try a
                                            few systems here to at
                                            least isolate if it is
                                            something peculiar
                                            to the InfiniBand path or
                                            if there is a more
                                            general problem in Margo.<br class="">
                                            <br class="">
                                            thanks,<br class="">
                                            <br class="">
                                            -Phil<br class="">
                                            <br class="">
                                            On 10/29/21 3:20 PM,
                                            Philip Davis wrote:<br class="">
                                            <blockquote type="cite" class="">Hello,<br class="">
                                              <br class="">
                                              I apologize in advance
                                              for the winding nature of
                                              this email. I’m not sure
                                              how to ask
                                              my question without explaining
                                              the story of my results
                                              some.<br class="">
                                              <br class="">
                                              I’m doing
                                              some characterization of
                                              our server performance
                                              under load, and I have
                                              a quirk
                                              of performance that I
                                              wanted to run by people to
                                              see if they make sense.
                                              My testing so far has been
                                              to iteratively send
                                              batches of RPCs
                                              using margo_iforward, and
                                              then measure the wait time
                                              until they are
                                              all complete. On the
                                              server side, handling the
                                              RPC includes
                                              a margo_bulk_transfer as a
                                              pull initiated
                                              on the server to pull (for
                                              now) 8 bytes. The payload
                                              of the RPC request
                                              is about 500 bytes, and
                                              the response payload is
                                              4 bytes.<br class="">
                                              <br class="">
                                              I’ve isolated my results
                                              down to one server rank
                                              and one client
                                              rank, because it’s an
                                              easier starting point to
                                              reason from. Below
                                              is a graph of some of
                                              my initial results.
                                              These results are from
                                              Frontera. The median times
                                              are good (a single
                                              RPC takes on the order of
                                              10s                      
                                                                       
                                                     
                                                of microseconds, which
                                              seems fantastic). However,
                                              the outliers are fairly
                                              high (note the log scale
                                              of the y-axis). With only
                                              one RPC per timestep, for
                                              example, there is a
                                              30x spread between the
                                              median and the max.<br class="">
                                              <br class="">
                                              <img id="x_126DD6C1-A420-4D70-8D1C-96320B7F54E7" src="cid:5EB5DF63-97AA-48FC-8BBE-E666E933D79F" class="" moz-do-not-send="true"><br class="">
                                              <br class="">
                                              I was
                                              hoping (expecting) the
                                              first timestep would be
                                              where the long
                                              replies resided, but that
                                              turned out not to be the
                                              case. Below are traces
                                              from the 1 RPC (blue) and
                                              2 RPC  (orange) per
                                              timestep cases, 5 trials
                                              of 10 timesteps for each
                                              case (normalized to fix
                                              the same                  
                                                                       
                                                            x-axis):<br class="">
                                              <br class="">
<PastedGraphic-6.png><br class="">
                                              <br class="">
                                              What strikes me is how
                                              consistent these
                                              results are across trials.
                                              For the 1 RPC per timestep
                                              case, the 3rd and 7th
                                              timestep are consistently slow (and
                                              the rest are fast).
                                              For the 2 RPC per timestep
                                              case, the 2nd and 4th
                                              timestep are always slow
                                              and sometimes the        
                                                                       
                                                                   
                                                10th is. These results
                                              are repeatable with very
                                              rare variation.<br class="">
                                              <br class="">
                                              For the single RPC case,
                                              I recorded some timers on
                                              the server side, and
                                              attempted to overlay them
                                              with the client
                                              side (there is
                                              some unknown offset,
                                              but probably on the order
                                              of 10s of microseconds at
                                              worst, given the pattern):<br class="">
                                              <br class="">
<PastedGraphic-7.png><br class="">
                                              <br class="">
                                              I blew up the first
                                              few timesteps of one of
                                              the trials:<br class="">
<PastedGraphic-8.png><br class="">
                                              <br class="">
                                              The different colors
                                              are different segments
                                              of the handler, but
                                              there doesn’t seem to be
                                              anything too interesting going
                                              on inside the handler. So
                                              it looks like the time is
                                              being introduced before
                                              the 3rd RPC
                                              handler starts, based on
                                              the where the gap appears
                                              on the server side.<br class="">
                                              <br class="">
                                              To try and isolate
                                              any dataspaces-specific behavior,
                                              I created a pure Margo
                                              test case that just sends
                                              a single rpc of the same
                                              size
                                              as dataspaces iteratively, whre
                                              the server side does an
                                              8-byte bulk
                                              transfer initiated by the
                                              server, and sends
                                              a response. The results
                                              are similar, except that
                                              it is now the 4th and
                                              8th timestep that are slow
                                              (and the first timestep
                                              is VERY
                                              long, presumably because
                                              rxm communication state is
being established. DataSpaces has an earlier RPC in its init that
                                              was absorbing
                                              this latency).<br class="">
                                              <br class="">
                                              I got
                                              margo profiling results
                                              for this test case:<br class="">
                                              <br class="">
                                              ```<br class="">
                                              3<br class="">
18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573<br class="">
                                              0xa2a1,term_rpc<br class="">
                                              0x27b5,put_rpc<br class="">
                                              0xd320,__shutdown__<br class="">
0x27b5 ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
0x27b5 ,0;0.041241646,200.000000000, 0;<br class="">
0xa2a1 ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
0xa2a1 ,0;0.000009298,1.000000000, 0;<br class="">
                                              ```<br class="">
                                              <br class="">
                                              So I guess my question
                                              at this point is, is there
                                              any sensible reason why
                                              the 4th and 8th RPC sent
                                              would have a long response
                                              time? I think I’ve cleared
                                              my code on the client
                                              side and server side, so
                                              it appears to be latency
                                              being introduced
                                              by Margo, LibFabric, Argobots,
                                              or the underlying OS. I do
                                              see long
                                              timesteps occasionally after
                                              this (perhaps
                                              every 20-30 timesteps)
                                              but these are
                                              not consistent.<br class="">
                                              <br class="">
                                              One last detail: this does
                                              not happen on Summit.
                                              On summit, I see about
                                              5-7x worse single-RPC performance (250-350 microseconds per
                                              RPC), but without
                                              the intermittent long timesteps.<br class="">
                                              <br class="">
                                              I can provide the
                                              minimal test case if it
                                              would be helpful. I
                                              am using APEX
                                              for timing results,
                                              and the
                                              following dependencies with
                                              Spack:<br class="">
                                              <br class="">
                                              <a class="moz-txt-link-abbreviated
                                                moz-txt-link-freetext" href="mailto:argobots@1.1" moz-do-not-send="true">argobots@1.1</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:json-c@0.15" moz-do-not-send="true">json-c@0.15</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:libfabric@1.13.1" moz-do-not-send="true">libfabric@1.13.1</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:mercury@2.0.1" moz-do-not-send="true">mercury@2.0.1</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:mochi-margo@0.9.5" moz-do-not-send="true">mochi-margo@0.9.5</a>  rdma-core@20<br class="">
                                              <br class="">
                                              Thanks,<br class="">
                                              Philip<br class="">
                                              <br class="">
                                              <br class="">
                                              <br class="">
                                              <br class="">
                                              <div class="">_______________________________________________</div>
                                              <div class="">mochi-devel
                                                mailing list</div>
                                              <br class="Apple-interchange-newline">
                                              <a class="moz-txt-link-abbreviated
                                                moz-txt-link-freetext" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">
                                              <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=EiX9%2B4KlSbKB7w2ov4OZtE7jCugYdchESAfX0D%2B%2FK7c%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="hT2kkjAIf/zHJhvtfpngbaSAG6BZQ0v653UkY5DV3lvADy7nxpFwHs9680ZLFYC4A/ZkOaDhVaoVT0dhguoFth03P3pU/ankBuXMiNnP4KyfpClP97DwHHcU9nwM04kJb9eamI6B/jo5QLEuZAQ97t2SUt+s9UTzpXLcXc/6/7I=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
                                              <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vyMI4AQ%2BbXY4WVJocNVN%2BYJp6H1tYsP7meHgbjriejE%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="ncm2uVm0Ugenklbk5rqBtY39R8xrytsxPVYOdy2NqnuC3nOSGasq7//KT8rW9cbT43LvCOivxAxgLyWaANsZOv0dwLdcqrgqnldIxtfDSfJzCXFTgmgI++3ODNu7UIcNUeMtmBw8YYusppzZlNxvPHAunqN44IwZIk9lYE5wtW0=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
                                            </blockquote>
_______________________________________________<br class="">
                                            mochi-devel mailing list<br class="">
                                            <a class="moz-txt-link-abbreviated
                                              moz-txt-link-freetext" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">
                                            <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411624266%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oVLJaNkVau19KFuQsfeEpIkYUBvfpHcNBiMwy39eYiE%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="eo+nqLYWCkMMdToapGKW5gCChdEC+hnMkybaq9TTMc+BLlej8fWIrt1WeKWex1ubGz/Kolyb76ecsFMW3CL4Je2n/gnoO704YF3Or7fc73FFazgqUg3EPFQRAx3+zz5WR7a1+QiDa2HcoGRw4/KG7diKB6bVs37W+tX6YsHhBJU=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
                                            <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411634259%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XjXCEsV2TNbLfPuH7iwMuolSCtaxnJZksVRu7V%2FfGNw%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="euRQ8SoMMl/IdjKCp7PxhwDgnaOLyG1nlSdl2QPAU17a7nYFM9JBPnoh1O/jSGuGv0+KkUsZMdBbbwg0ENFlNOUiYCxv1aPqeW0whx609AFP7y+OEMFUuRm9Ir3OvZlY4jzkE5B2v68VbQSLyROuz/2i8WR42H/ni5Na2Gn9noc=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
                                          </blockquote>
                                          <br class="">
                                        </blockquote>
                                      </blockquote>
                                      <br class="">
                                    </blockquote>
                                  </blockquote>
                                  <br class="">
                                </blockquote>
                              </blockquote>
                              <br class="">
                            </blockquote>
                          </blockquote>
                          <br class="">
                        </blockquote>
                      </blockquote>
                      <br class="">
                    </span></blockquote>
                </div>
              </div>
            </blockquote>
          </div>
          <br class="">
        </div>
      </div>
    </blockquote>
  </body>
</html>