<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 11/5/21 12:23 PM, Philip Davis
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu">
      
      That’s a good find. I’m reading through the code and
      documentation, and I’m having a little trouble understanding what
      the difference between 0 and -1 for that last argument is when the
      second to last argument is 0. I see in documentation:
      <div class=""><br class="">
      </div>
      <div class=""><span style="caret-color: rgb(64, 64, 64); color:
          rgb(64, 64, 64); font-family: Lato, proxima-nova,
          "Helvetica Neue", Arial, sans-serif; font-size:
          16px; background-color: rgb(252, 252, 252);" class="">The
          third argument indicates whether an Argobots execution stream
          (ES) should be created to run the Mercury progress loop. If
          this argument is set to 0, the progress loop is going to run
          in the context of the main ES (this should be the standard
          scenario, unless you have a good reason for not using the main
          ES, such as the main ES using MPI primitives that could block
          the progress loop). A value of 1 will make Margo create an ES
          to run the Mercury progress loop. The fourth argument is the
          number of ES to create and use for executing RPC handlers. A
          value of 0 will make Margo execute RPCs in the ES that called </span><code class="code literal docutils notranslate" style="box-sizing:
          border-box; font-family: SFMono-Regular, Menlo, Monaco,
          Consolas, "Liberation Mono", "Courier
          New", Courier, monospace; white-space: nowrap; max-width:
          100%; background-color: rgb(255, 255, 255); border: 1px solid
          rgb(225, 228, 229); padding: 2px 5px; color: rgb(231, 76, 60);
          overflow-x: auto;"><span class="pre" style="box-sizing:
            border-box;">margo_init</span></code><span style="caret-color: rgb(64, 64, 64); color: rgb(64, 64, 64);
          font-family: Lato, proxima-nova, "Helvetica Neue",
          Arial, sans-serif; font-size: 16px; background-color: rgb(252,
          252, 252);" class="">. A value of -1 will make Margo execute
          the RPCs in the ES running the progress loop. A positive value
          will make Margo create new ESs to run the RPCs.</span></div>
      <div class=""><font class="" size="3" face="Lato, proxima-nova,
          Helvetica Neue, Arial, sans-serif" color="#404040"><span style="caret-color: rgb(64, 64, 64); background-color:
            rgb(252, 252, 252);" class=""><br class="">
          </span></font></div>
      <span class="">What is the difference between the ‘main ES’ (last
        two arguments 0,-1) and the 'ES that called margo_init’  (last
        two arguments 0,0) in the absence of me creating new execution
        streams? Or maybe I’m not interpreting the documentation
        correctly?<br class="">
      </span></blockquote>
    <p><br>
    </p>
    <p>You are right, those are equivalent when the next to last
      argument is 0 :)  The main ES and progress thread ES are one and
      the same in that case, so the RPCs go to the same place either
      way.</p>
    <p>I've narrowed down a little further and found that the stalls
      occur when there are 2 dedicated RPC ESs but not when there is
      just 1 dedicated RPC ES.  That isolates the problem slightly
      further, in that it's not just some cost associated with relaying
      ULTs to another thread (that's happening every time in the 1
      handler case too), but something that only happens when multiple
      ESes could potentially service an RPC.</p>
    <p>thanks,</p>
    <p>-Phil<br>
      <span class=""></span></p>
    <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu"><span class="">
        <blockquote type="cite" class="">On Nov 5, 2021, at 10:29 AM,
          Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>
          wrote:<br class="">
          <br class="">
          Srinivasan Ramesh (U. Oregon) has done some work on
          fine-grained RPC component timing, but it's not in the
          mainline margo tree so we'll need to do a little more work to
          look at it.<br class="">
          <br class="">
          In the mean time on a hunch I found that I can make the
          latency consistent on Cooley by altering the margo_init()
          arguments to be (..., 0, -1) in server.c (meaning that no
          additional execution streams are used at all; all mercury
          progress and all rpc handlers are executed using user-level
          threads in the process's primary execution stream (OS thread).<br class="">
          <br class="">
          It's expected that there would be some more jitter jumping
          across OS threads for RPC handling, but it shouldn't be that
          extreme, regular, or system-specific.<br class="">
          <br class="">
          Thanks again for the test case and the Apex instrumentation;
          this is the sort of thing that's normally really hard to
          isolate.<br class="">
          <br class="">
          thanks,<br class="">
          <br class="">
          -Phil<br class="">
          <br class="">
          On 11/5/21 10:09 AM, Philip Davis wrote:<br class="">
          <blockquote type="cite" cite="mid:B0643541-F015-4EB6-AD30-F9F85B196465@rutgers.edu" class="">
            That’s extremely interesting. <br class="">
            <br class="">
            Are there any internal timers in Margo that can tell what
            the delay was between the server’s progress thread
            queueing the rpc and the handler thread starting to handle
            it? If I’m understanding <a href="https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html" class="moz-txt-link-freetext" moz-do-not-send="true">https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html</a> correctly,
            it seems to me like that is the most likely place for
            non-deterministic delay to be introduced by argobots in the
            client -> server direction.<br class="">
            <br class="">
            I just ran a quick test where I changed the number of
            handler threads to 5, and I saw no change in behavior (still
            4 and 8, not 5 and 10).<br class="">
            <br class="">
            <br class="">
            <blockquote type="cite" class="">On Nov 4, 2021, at 9:04 PM,
              Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
              <br class="">
              I have some more interesting info to share from trying a
              few different configurations.<br class="">
              <br class="">
              sm (on my laptop) and ofi+gni (on theta) do not exhibit
              this behavior; they have consistent performance across
              RPCs.<br class="">
              <br class="">
              ofi+verbs (on cooley) shows the same thing you were
              seeing; the 4th and 8th RPCs are slow.<br class="">
              <br class="">
              Based on the above, it would sound like a problem with the
              libfabric/verbs path.  But on Jerome's suggestion I tried
              some other supported transports on cooley as well.  In
              particular I ran the same benchmark (the same build in
              fact, I just compiled in support for multiple transports
              and cycling through them in a single job script
              with runtime options) with these combinations:<br class="">
              <br class="">
              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                ofi+verbs<br class="">
              </div>
              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                ofi+tcp<br class="">
              </div>
              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                ofi+sockets<br class="">
              </div>
              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
                bmi+tcp<br class="">
              </div>
              All of them show the same thing!  4th and 8th RPCs are at
              least an order of magnitude slower than the other RPCs.
               That was a surprising result.  The bmi+tcp one isn't even
              using libfabric at all, even though they are all using the
              same underlying hardware.<br class="">
              <br class="">
              I'm not sure what to make of that yet.  Possibly something
              with threading or signalling?<br class="">
              <br class="">
              thanks,<br class="">
              <br class="">
              -Phil<br class="">
              <br class="">
              On 11/2/21 2:37 PM, Philip Davis wrote:<br class="">
              <blockquote type="cite" cite="mid:82DF3EFD-9A7E-414C-86F7-1894000C30E1@rutgers.edu" class="">
                I’m glad you were able to reproduce it on a different
                system, thanks for letting me know. I’m not sure what
                the overlaps between Frontera and Cooley are (that
                aren’t shared by Summit); a quick look shows they are
                both Intel, and both FDR, but there’s probably more
                salient details.<br class="">
                <br class="">
                <blockquote type="cite" class="">On Nov 2, 2021, at 2:24
                  PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
                  <br class="">
                  Ok.  Interesting.  I didn't see anything unusual in
                  the timing on my laptop with sm (other than it being
                  a bit slow, but I wasn't tuning or worrying about core
                  affinity or anything).  On Cooley, a somewhat
                  older Linux cluster with InfiniBand, I see the 4th and
                  8th RPC delay you were talking about:<br class="">
                  <br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent GUID":0}},<br class="">
                  <br class="">
                  (assuming the first is high due to connection
                  establishment)<br class="">
                  <br class="">
                  I'll check some other systems/transports, but I wanted
                  to go ahead and share that I've been able to reproduce
                  what you were seeing.<br class="">
                  <br class="">
                  thanks,<br class="">
                  <br class="">
                  -Phil<br class="">
                  <br class="">
                  On 11/2/21 1:49 PM, Philip Davis wrote:<br class="">
                  <blockquote type="cite" cite="mid:35E29989-49C6-4AA0-8C41-54D97F1ACBF6@rutgers.edu" class="">
                    Glad that’s working now.<br class="">
                    <br class="">
                    It is the put_wait events, and “dur” is the right
                    field. Those units are microseconds.<br class="">
                    <br class="">
                    <br class="">
                    <blockquote type="cite" class="">On Nov 2, 2021,
                      at 1:12 PM, Phil Carns
                      <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
                      <br class="">
                      Thanks Philip, the "= {0};" initialization of
                      that struct got me going.<br class="">
                      <br class="">
                      I can run the test case now and it is producing
                      output in the client and server perf dirs.  Just
                      to sanity check what to look for, I think the
                      problem should be exhibited in the "put_wait" or
                      maybe "do_put" trace events on the client?  For
                      example on my laptop I see this:<br class="">
                      <br class="">
                      carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]>
                      grep do_put trace_events.0.json<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent GUID":0}},<br class="">
                      carns-x1-7g ~/w/d/d/m/client.perf> grep
                      put_wait trace_events.0.json<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent GUID":0}},<br class="">
                      <br class="">
                      I should look at the "dur" field right?  What are
                      the units on that?<br class="">
                      <br class="">
                      I'll see if I can run this on a "real" system
                      shortly.<br class="">
                      <br class="">
                      thanks!<br class="">
                      <br class="">
                      -Phil<br class="">
                      <br class="">
                      On 11/2/21 12:11 PM, Philip Davis wrote:<br class="">
                      <blockquote type="cite" cite="mid:104B5045-328B-48EE-A32C-09D25817EED7@rutgers.edu" class="">
                        Hi Phil,<br class="">
                        <br class="">
                        Sorry the data structures are like that; I
                        wanted to preserve as much of the RPC size and
                        ordering in case it ended up being important.<br class="">
                        <br class="">
                        I’m surprised in.odsc.size is troublesome, as
                        I set in.odsc.size with the line `in.odsc.size
                        = sizeof(odsc);`. I’m not sure what could
                        be corrupting that value in the meantime.<br class="">
                        <br class="">
                        I don’t set in.odsc.gdim_size (which was an
                        oversight, since that’s non-zero normally), so
                        I’m less surprised that’s an issue. I thought
                        I initialized `in` to zero, but I see I
                        didn’t do that after all.<br class="">
                        <br class="">
                        Maybe change the line `bulk_gdim_t in;` to
                        `bulk_gdim_t in = {0};`<br class="">
                        <br class="">
                        <br class="">
                        <br class="">
                        <blockquote type="cite" class="">On Nov 2, 2021,
                          at 11:48 AM, Phil
                          Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
                          <br class="">
                          Awesome, thanks Philip.  It came through fine.<br class="">
                          <br class="">
                          I started by modifying the job script slightly
                          to just run it on my laptop with sm (I
                          wanted to make sure I understood the test
                          case, and how to use apex,
                          before trying elsewhere).  Does in.size needs
                          to be set in client.c?  For me there is
                          a random value in that field and it is causing
                          the                                          
                                          encoder on the forward
                          to attempt a very large allocation.  The same
                          might be true of gdim_size if it got past that
                          step.  I started to alter them but then I
                          wasn't sure what the implications were.<br class="">
                          <br class="">
                          (fwiw I needed to include stdlib.h
                          in common.h, but I've hit that a couple
                          of times recently on codes
                          that didn't previously generate warnings;
                          I think something in Ubuntu has gotten
                          strict about that recently).<br class="">
                          <br class="">
                          thanks,<br class="">
                          <br class="">
                          -Phil<br class="">
                          <br class="">
                          <br class="">
                          <br class="">
                          On 11/1/21 4:51 PM, Philip Davis wrote:<br class="">
                          <blockquote type="cite" cite="mid:67D76603-B4B9-48E5-9452-966D1E4866D2@rutgers.edu" class="">
                            Hi Phil,<br class="">
                            <br class="">
                            I’ve attached the reproducer. I see the
                            4th and 8th issue on Frontera, but
                            not Summit. Hopefully it will build and run
                            without too much modification. Let me know
                            if there are any issues with running it
                            (or if the anl listserv eats the
                            tarball, which I kind of expect).<br class="">
                            <br class="">
                            Thanks,<br class="">
                            Philip<br class="">
                            <br class="">
                            <br class="">
                            <blockquote type="cite" class="">On Nov 1,
                              2021, at 11:14 AM, Phil
                              Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
                              <br class="">
                              Hi Philip,<br class="">
                              <br class="">
                              (FYI I think the first image didn't
                              come through in your email, but I
                              think the others are sufficient to get
                              across what you are seeing)<br class="">
                              <br class="">
                              I don't have any idea what would
                              cause that.
                               The recently released libfabric 1.13.2 (available
                              in spack from
                              the mochi-spack-packages repo) includes
                              some fixes to the rxm provider that could
                              be relevant to Frontera and Summit,
                              but nothing that aligns with what you
                              are observing.<br class="">
                              <br class="">
                              If it were later in the sequence
                              (much later) I would speculate
                              that memory allocation/deallocation cycles
                              were eventually causing a hiccup.
                               We've seen something like that in the
                              past, and it's a theory that we could then
                              test
                              with alternative allocators like jemalloc.  That's
                              not memory allocation jitter that early in
                              the run though.<br class="">
                              <br class="">
                              Please do share your reproducer if you
                              don't mind!  We can try a few systems
                              here to at least isolate if it is
                              something peculiar to the InfiniBand path
                              or if there is a more general problem
                              in Margo.<br class="">
                              <br class="">
                              thanks,<br class="">
                              <br class="">
                              -Phil<br class="">
                              <br class="">
                              On 10/29/21 3:20 PM, Philip Davis wrote:<br class="">
                              <blockquote type="cite" class="">Hello,<br class="">
                                <br class="">
                                I apologize in advance for the
                                winding nature of this email. I’m
                                not sure how to ask
                                my question without explaining the story
                                of my results some.<br class="">
                                <br class="">
                                I’m doing some characterization of our
                                server performance under load, and I
                                have a quirk of performance that I
                                wanted to run by people to see if they
                                make sense. My testing so far has been
                                to iteratively send batches of RPCs
                                using margo_iforward, and then measure
                                the wait time until they are
                                all complete. On the server side,
                                handling the RPC includes
                                a margo_bulk_transfer as a
                                pull initiated on the server to pull
                                (for now) 8 bytes. The payload of
                                the RPC request is about 500 bytes, and
                                the response payload is 4 bytes.<br class="">
                                <br class="">
                                I’ve isolated my results down to one
                                server rank and one client rank, because
                                it’s an easier starting point to
                                reason from. Below is a graph of some of
                                my initial results. These results
                                are from Frontera. The median times are
                                good (a single RPC takes on the order of
                                10s                                    
                                                   
                                  of microseconds, which
                                seems fantastic). However, the outliers
                                are fairly high (note the log scale of
                                the y-axis). With only one
                                RPC per timestep, for example, there is
                                a 30x spread between the median and the
                                max.<br class="">
                                <br class="">
                                <img id="x_126DD6C1-A420-4D70-8D1C-96320B7F54E7" src="cid:5EB5DF63-97AA-48FC-8BBE-E666E933D79F" class="" moz-do-not-send="true"><br class="">
                                <br class="">
                                I was hoping (expecting) the
                                first timestep would be where the long
                                replies resided, but that turned out not
                                to be the case. Below are traces
                                from the 1 RPC (blue) and 2 RPC
                                 (orange) per timestep cases, 5 trials
                                of 10 timesteps for each
                                case (normalized to fix the same        
                                                                       
                                          x-axis):<br class="">
                                <br class="">
                                <PastedGraphic-6.png><br class="">
                                <br class="">
                                What strikes me is how consistent these
                                results are across trials. For the 1 RPC
                                per timestep case, the 3rd and 7th
                                timestep are consistently slow (and the
                                rest are fast). For the 2 RPC
                                per timestep case, the 2nd and 4th
                                timestep are always slow and sometimes
                                the                                    
                                                      10th is.
                                These results are repeatable with very
                                rare variation.<br class="">
                                <br class="">
                                For the single RPC case, I recorded
                                some timers on the server side, and
                                attempted to overlay them with
                                the client side (there is
                                some unknown offset, but probably on the
                                order of 10s of microseconds at
                                worst, given the pattern):<br class="">
                                <br class="">
                                <PastedGraphic-7.png><br class="">
                                <br class="">
                                I blew up the first few timesteps of one
                                of the trials:<br class="">
                                <PastedGraphic-8.png><br class="">
                                <br class="">
                                The different colors
                                are different segments of the
                                handler, but there doesn’t seem to be
                                anything too interesting going on inside
                                the handler. So it looks like the time
                                is being introduced before the 3rd RPC
                                handler starts, based on the where the
                                gap appears on the server side.<br class="">
                                <br class="">
                                To try and isolate
                                any dataspaces-specific behavior,
                                I created a pure Margo test case that
                                just sends a single rpc of the same size
                                as dataspaces iteratively, whre
                                the server side does an 8-byte bulk
                                transfer initiated by the server, and
                                sends a response. The results
                                are similar, except that it is now the
                                4th and 8th timestep that are slow
                                (and the first timestep is VERY
                                long, presumably because
                                rxm communication state is
                                being established. DataSpaces has an
                                earlier RPC in its init that
                                was absorbing this latency).<br class="">
                                <br class="">
                                I got margo profiling results for this
                                test case:<br class="">
                                <br class="">
                                ```<br class="">
                                3<br class="">
18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573<br class="">
                                0xa2a1,term_rpc<br class="">
                                0x27b5,put_rpc<br class="">
                                0xd320,__shutdown__<br class="">
0x27b5 ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
                                0x27b5 ,0;0.041241646,200.000000000, 0;<br class="">
0xa2a1 ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
                                0xa2a1 ,0;0.000009298,1.000000000, 0;<br class="">
                                ```<br class="">
                                <br class="">
                                So I guess my question at this point
                                is, is there any sensible reason why
                                the 4th and 8th RPC sent would have a
                                long response time? I think I’ve cleared
                                my code on the client side and
                                server side, so it appears to be latency
                                being introduced
                                by Margo, LibFabric, Argobots, or the
                                underlying OS. I do see long
                                timesteps occasionally after
                                this (perhaps every 20-30 timesteps)
                                but these are not consistent.<br class="">
                                <br class="">
                                One last detail: this does not happen
                                on Summit. On summit, I see about
                                5-7x worse single-RPC performance (250-350 microseconds per
                                RPC), but without
                                the intermittent long timesteps.<br class="">
                                <br class="">
                                I can provide the minimal test case
                                if it would be helpful. I am using APEX
                                for timing results, and the
                                following dependencies with Spack:<br class="">
                                <br class="">
<a class="moz-txt-link-abbreviated" href="mailto:argobots@1.1">argobots@1.1</a>  <a class="moz-txt-link-abbreviated" href="mailto:json-c@0.15">json-c@0.15</a>  <a class="moz-txt-link-abbreviated" href="mailto:libfabric@1.13.1">libfabric@1.13.1</a>  <a class="moz-txt-link-abbreviated" href="mailto:mercury@2.0.1">mercury@2.0.1</a>  <a class="moz-txt-link-abbreviated" href="mailto:mochi-margo@0.9.5">mochi-margo@0.9.5</a>  rdma-core@20<br class="">
                                <br class="">
                                Thanks,<br class="">
                                Philip<br class="">
                                <br class="">
                                <br class="">
                                <br class="">
                                <br class="">
                                <div class="">_______________________________________________</div>
                                <div class="">mochi-devel mailing list</div>
                                <br class="Apple-interchange-newline">
                                <a class="moz-txt-link-abbreviated" href="mailto:mochi-devel@lists.mcs.anl.gov">mochi-devel@lists.mcs.anl.gov</a><br class="">
<a class="moz-txt-link-freetext" href="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
<a class="moz-txt-link-freetext" href="https://www.mcs.anl.gov/research/projects/mochi">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
                              </blockquote>
_______________________________________________<br class="">
                              mochi-devel mailing list<br class="">
                              <a class="moz-txt-link-abbreviated" href="mailto:mochi-devel@lists.mcs.anl.gov">mochi-devel@lists.mcs.anl.gov</a><br class="">
<a class="moz-txt-link-freetext" href="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
<a class="moz-txt-link-freetext" href="https://www.mcs.anl.gov/research/projects/mochi">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
                            </blockquote>
                            <br class="">
                          </blockquote>
                        </blockquote>
                        <br class="">
                      </blockquote>
                    </blockquote>
                    <br class="">
                  </blockquote>
                </blockquote>
                <br class="">
              </blockquote>
            </blockquote>
            <br class="">
          </blockquote>
        </blockquote>
        <br class="">
      </span>
    </blockquote>
  </body>
</html>