<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 11/5/21 12:23 PM, Philip Davis

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu">

      That’s a good find. I’m reading through the code and

      documentation, and I’m having a little trouble understanding what

      the difference between 0 and -1 for that last argument is when the

      second to last argument is 0. I see in documentation:

      <div class=""><br class="">

      </div>

      <div class=""><span style="caret-color: rgb(64, 64, 64); color:

          rgb(64, 64, 64); font-family: Lato, proxima-nova,

          "Helvetica Neue", Arial, sans-serif; font-size:

          16px; background-color: rgb(252, 252, 252);" class="">The

          third argument indicates whether an Argobots execution stream

          (ES) should be created to run the Mercury progress loop. If

          this argument is set to 0, the progress loop is going to run

          in the context of the main ES (this should be the standard

          scenario, unless you have a good reason for not using the main

          ES, such as the main ES using MPI primitives that could block

          the progress loop). A value of 1 will make Margo create an ES

          to run the Mercury progress loop. The fourth argument is the

          number of ES to create and use for executing RPC handlers. A

          value of 0 will make Margo execute RPCs in the ES that called </span><code class="code literal docutils notranslate" style="box-sizing:

          border-box; font-family: SFMono-Regular, Menlo, Monaco,

          Consolas, "Liberation Mono", "Courier

          New", Courier, monospace; white-space: nowrap; max-width:

          100%; background-color: rgb(255, 255, 255); border: 1px solid

          rgb(225, 228, 229); padding: 2px 5px; color: rgb(231, 76, 60);

          overflow-x: auto;"><span class="pre" style="box-sizing:

            border-box;">margo_init</span></code><span style="caret-color: rgb(64, 64, 64); color: rgb(64, 64, 64);

          font-family: Lato, proxima-nova, "Helvetica Neue",

          Arial, sans-serif; font-size: 16px; background-color: rgb(252,

          252, 252);" class="">. A value of -1 will make Margo execute

          the RPCs in the ES running the progress loop. A positive value

          will make Margo create new ESs to run the RPCs.</span></div>

      <div class=""><font class="" size="3" face="Lato, proxima-nova,

          Helvetica Neue, Arial, sans-serif" color="#404040"><span style="caret-color: rgb(64, 64, 64); background-color:

            rgb(252, 252, 252);" class=""><br class="">

          </span></font></div>

      <span class="">What is the difference between the ‘main ES’ (last

        two arguments 0,-1) and the 'ES that called margo_init’  (last

        two arguments 0,0) in the absence of me creating new execution

        streams? Or maybe I’m not interpreting the documentation

        correctly?<br class="">

      </span></blockquote>

    <p><br>

    </p>

    <p>You are right, those are equivalent when the next to last

      argument is 0 :)  The main ES and progress thread ES are one and

      the same in that case, so the RPCs go to the same place either

      way.</p>

    <p>I've narrowed down a little further and found that the stalls

      occur when there are 2 dedicated RPC ESs but not when there is

      just 1 dedicated RPC ES.  That isolates the problem slightly

      further, in that it's not just some cost associated with relaying

      ULTs to another thread (that's happening every time in the 1

      handler case too), but something that only happens when multiple

      ESes could potentially service an RPC.</p>

    <p>thanks,</p>

    <p>-Phil<br>

      <span class=""></span></p>

    <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu"><span class="">

        <blockquote type="cite" class="">On Nov 5, 2021, at 10:29 AM,

          Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>

          wrote:<br class="">

          <br class="">

          Srinivasan Ramesh (U. Oregon) has done some work on

          fine-grained RPC component timing, but it's not in the

          mainline margo tree so we'll need to do a little more work to

          look at it.<br class="">

          <br class="">

          In the mean time on a hunch I found that I can make the

          latency consistent on Cooley by altering the margo_init()

          arguments to be (..., 0, -1) in server.c (meaning that no

          additional execution streams are used at all; all mercury

          progress and all rpc handlers are executed using user-level

          threads in the process's primary execution stream (OS thread).<br class="">

          <br class="">

          It's expected that there would be some more jitter jumping

          across OS threads for RPC handling, but it shouldn't be that

          extreme, regular, or system-specific.<br class="">

          <br class="">

          Thanks again for the test case and the Apex instrumentation;

          this is the sort of thing that's normally really hard to

          isolate.<br class="">

          <br class="">

          thanks,<br class="">

          <br class="">

          -Phil<br class="">

          <br class="">

          On 11/5/21 10:09 AM, Philip Davis wrote:<br class="">

          <blockquote type="cite" cite="mid:B0643541-F015-4EB6-AD30-F9F85B196465@rutgers.edu" class="">

            That’s extremely interesting. <br class="">

            <br class="">

            Are there any internal timers in Margo that can tell what

            the delay was between the server’s progress thread

            queueing the rpc and the handler thread starting to handle

            it? If I’m understanding <a href="https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html" class="moz-txt-link-freetext" moz-do-not-send="true">https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html</a> correctly,

            it seems to me like that is the most likely place for

            non-deterministic delay to be introduced by argobots in the

            client -> server direction.<br class="">

            <br class="">

            I just ran a quick test where I changed the number of

            handler threads to 5, and I saw no change in behavior (still

            4 and 8, not 5 and 10).<br class="">

            <br class="">

            <br class="">

            <blockquote type="cite" class="">On Nov 4, 2021, at 9:04 PM,

              Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">

              <br class="">

              I have some more interesting info to share from trying a

              few different configurations.<br class="">

              <br class="">

              sm (on my laptop) and ofi+gni (on theta) do not exhibit

              this behavior; they have consistent performance across

              RPCs.<br class="">

              <br class="">

              ofi+verbs (on cooley) shows the same thing you were

              seeing; the 4th and 8th RPCs are slow.<br class="">

              <br class="">

              Based on the above, it would sound like a problem with the

              libfabric/verbs path.  But on Jerome's suggestion I tried

              some other supported transports on cooley as well.  In

              particular I ran the same benchmark (the same build in

              fact, I just compiled in support for multiple transports

              and cycling through them in a single job script

              with runtime options) with these combinations:<br class="">

              <br class="">

              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                ofi+verbs<br class="">

              </div>

              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                ofi+tcp<br class="">

              </div>

              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                ofi+sockets<br class="">

              </div>

              <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                bmi+tcp<br class="">

              </div>

              All of them show the same thing!  4th and 8th RPCs are at

              least an order of magnitude slower than the other RPCs.

               That was a surprising result.  The bmi+tcp one isn't even

              using libfabric at all, even though they are all using the

              same underlying hardware.<br class="">

              <br class="">

              I'm not sure what to make of that yet.  Possibly something

              with threading or signalling?<br class="">

              <br class="">

              thanks,<br class="">

              <br class="">

              -Phil<br class="">

              <br class="">

              On 11/2/21 2:37 PM, Philip Davis wrote:<br class="">

              <blockquote type="cite" cite="mid:82DF3EFD-9A7E-414C-86F7-1894000C30E1@rutgers.edu" class="">

                I’m glad you were able to reproduce it on a different

                system, thanks for letting me know. I’m not sure what

                the overlaps between Frontera and Cooley are (that

                aren’t shared by Summit); a quick look shows they are

                both Intel, and both FDR, but there’s probably more

                salient details.<br class="">

                <br class="">

                <blockquote type="cite" class="">On Nov 2, 2021, at 2:24

                  PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">

                  <br class="">

                  Ok.  Interesting.  I didn't see anything unusual in

                  the timing on my laptop with sm (other than it being

                  a bit slow, but I wasn't tuning or worrying about core

                  affinity or anything).  On Cooley, a somewhat

                  older Linux cluster with InfiniBand, I see the 4th and

                  8th RPC delay you were talking about:<br class="">

                  <br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent GUID":0}},<br class="">

                  <br class="">

                  (assuming the first is high due to connection

                  establishment)<br class="">

                  <br class="">

                  I'll check some other systems/transports, but I wanted

                  to go ahead and share that I've been able to reproduce

                  what you were seeing.<br class="">

                  <br class="">

                  thanks,<br class="">

                  <br class="">

                  -Phil<br class="">

                  <br class="">

                  On 11/2/21 1:49 PM, Philip Davis wrote:<br class="">

                  <blockquote type="cite" cite="mid:35E29989-49C6-4AA0-8C41-54D97F1ACBF6@rutgers.edu" class="">

                    Glad that’s working now.<br class="">

                    <br class="">

                    It is the put_wait events, and “dur” is the right

                    field. Those units are microseconds.<br class="">

                    <br class="">

                    <br class="">

                    <blockquote type="cite" class="">On Nov 2, 2021,

                      at 1:12 PM, Phil Carns

                      <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">

                      <br class="">

                      Thanks Philip, the "= {0};" initialization of

                      that struct got me going.<br class="">

                      <br class="">

                      I can run the test case now and it is producing

                      output in the client and server perf dirs.  Just

                      to sanity check what to look for, I think the

                      problem should be exhibited in the "put_wait" or

                      maybe "do_put" trace events on the client?  For

                      example on my laptop I see this:<br class="">

                      <br class="">

                      carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]>

                      grep do_put trace_events.0.json<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent GUID":0}},<br class="">

                      carns-x1-7g ~/w/d/d/m/client.perf> grep

                      put_wait trace_events.0.json<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent GUID":0}},<br class="">

                      <br class="">

                      I should look at the "dur" field right?  What are

                      the units on that?<br class="">

                      <br class="">

                      I'll see if I can run this on a "real" system

                      shortly.<br class="">

                      <br class="">

                      thanks!<br class="">

                      <br class="">

                      -Phil<br class="">

                      <br class="">

                      On 11/2/21 12:11 PM, Philip Davis wrote:<br class="">

                      <blockquote type="cite" cite="mid:104B5045-328B-48EE-A32C-09D25817EED7@rutgers.edu" class="">

                        Hi Phil,<br class="">

                        <br class="">

                        Sorry the data structures are like that; I

                        wanted to preserve as much of the RPC size and

                        ordering in case it ended up being important.<br class="">

                        <br class="">

                        I’m surprised in.odsc.size is troublesome, as

                        I set in.odsc.size with the line `in.odsc.size

                        = sizeof(odsc);`. I’m not sure what could

                        be corrupting that value in the meantime.<br class="">

                        <br class="">

                        I don’t set in.odsc.gdim_size (which was an

                        oversight, since that’s non-zero normally), so

                        I’m less surprised that’s an issue. I thought

                        I initialized `in` to zero, but I see I

                        didn’t do that after all.<br class="">

                        <br class="">

                        Maybe change the line `bulk_gdim_t in;` to

                        `bulk_gdim_t in = {0};`<br class="">

                        <br class="">

                        <br class="">

                        <br class="">

                        <blockquote type="cite" class="">On Nov 2, 2021,

                          at 11:48 AM, Phil

                          Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">

                          <br class="">

                          Awesome, thanks Philip.  It came through fine.<br class="">

                          <br class="">

                          I started by modifying the job script slightly

                          to just run it on my laptop with sm (I

                          wanted to make sure I understood the test

                          case, and how to use apex,

                          before trying elsewhere).  Does in.size needs

                          to be set in client.c?  For me there is

                          a random value in that field and it is causing

                          the                                          

                                          encoder on the forward

                          to attempt a very large allocation.  The same

                          might be true of gdim_size if it got past that

                          step.  I started to alter them but then I

                          wasn't sure what the implications were.<br class="">

                          <br class="">

                          (fwiw I needed to include stdlib.h

                          in common.h, but I've hit that a couple

                          of times recently on codes

                          that didn't previously generate warnings;

                          I think something in Ubuntu has gotten

                          strict about that recently).<br class="">

                          <br class="">

                          thanks,<br class="">

                          <br class="">

                          -Phil<br class="">

                          <br class="">

                          <br class="">

                          <br class="">

                          On 11/1/21 4:51 PM, Philip Davis wrote:<br class="">

                          <blockquote type="cite" cite="mid:67D76603-B4B9-48E5-9452-966D1E4866D2@rutgers.edu" class="">

                            Hi Phil,<br class="">

                            <br class="">

                            I’ve attached the reproducer. I see the

                            4th and 8th issue on Frontera, but

                            not Summit. Hopefully it will build and run

                            without too much modification. Let me know

                            if there are any issues with running it

                            (or if the anl listserv eats the

                            tarball, which I kind of expect).<br class="">

                            <br class="">

                            Thanks,<br class="">

                            Philip<br class="">

                            <br class="">

                            <br class="">

                            <blockquote type="cite" class="">On Nov 1,

                              2021, at 11:14 AM, Phil

                              Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">

                              <br class="">

                              Hi Philip,<br class="">

                              <br class="">

                              (FYI I think the first image didn't

                              come through in your email, but I

                              think the others are sufficient to get

                              across what you are seeing)<br class="">

                              <br class="">

                              I don't have any idea what would

                              cause that.

                               The recently released libfabric 1.13.2 (available

                              in spack from

                              the mochi-spack-packages repo) includes

                              some fixes to the rxm provider that could

                              be relevant to Frontera and Summit,

                              but nothing that aligns with what you

                              are observing.<br class="">

                              <br class="">

                              If it were later in the sequence

                              (much later) I would speculate

                              that memory allocation/deallocation cycles

                              were eventually causing a hiccup.

                               We've seen something like that in the

                              past, and it's a theory that we could then

                              test

                              with alternative allocators like jemalloc.  That's

                              not memory allocation jitter that early in

                              the run though.<br class="">

                              <br class="">

                              Please do share your reproducer if you

                              don't mind!  We can try a few systems

                              here to at least isolate if it is

                              something peculiar to the InfiniBand path

                              or if there is a more general problem

                              in Margo.<br class="">

                              <br class="">

                              thanks,<br class="">

                              <br class="">

                              -Phil<br class="">

                              <br class="">

                              On 10/29/21 3:20 PM, Philip Davis wrote:<br class="">

                              <blockquote type="cite" class="">Hello,<br class="">

                                <br class="">

                                I apologize in advance for the

                                winding nature of this email. I’m

                                not sure how to ask

                                my question without explaining the story

                                of my results some.<br class="">

                                <br class="">

                                I’m doing some characterization of our

                                server performance under load, and I

                                have a quirk of performance that I

                                wanted to run by people to see if they

                                make sense. My testing so far has been

                                to iteratively send batches of RPCs

                                using margo_iforward, and then measure

                                the wait time until they are

                                all complete. On the server side,

                                handling the RPC includes

                                a margo_bulk_transfer as a

                                pull initiated on the server to pull

                                (for now) 8 bytes. The payload of

                                the RPC request is about 500 bytes, and

                                the response payload is 4 bytes.<br class="">

                                <br class="">

                                I’ve isolated my results down to one

                                server rank and one client rank, because

                                it’s an easier starting point to

                                reason from. Below is a graph of some of

                                my initial results. These results

                                are from Frontera. The median times are

                                good (a single RPC takes on the order of

                                10s                                    

                                  of microseconds, which

                                seems fantastic). However, the outliers

                                are fairly high (note the log scale of

                                the y-axis). With only one

                                RPC per timestep, for example, there is

                                a 30x spread between the median and the

                                max.<br class="">

                                <br class="">

                                <img id="x_126DD6C1-A420-4D70-8D1C-96320B7F54E7" src="cid:5EB5DF63-97AA-48FC-8BBE-E666E933D79F" class="" moz-do-not-send="true"><br class="">

                                <br class="">

                                I was hoping (expecting) the

                                first timestep would be where the long

                                replies resided, but that turned out not

                                to be the case. Below are traces

                                from the 1 RPC (blue) and 2 RPC

                                 (orange) per timestep cases, 5 trials

                                of 10 timesteps for each

                                case (normalized to fix the same        

                                          x-axis):<br class="">

                                <br class="">

                                <PastedGraphic-6.png><br class="">

                                <br class="">

                                What strikes me is how consistent these

                                results are across trials. For the 1 RPC

                                per timestep case, the 3rd and 7th

                                timestep are consistently slow (and the

                                rest are fast). For the 2 RPC

                                per timestep case, the 2nd and 4th

                                timestep are always slow and sometimes

                                the                                    

                                                      10th is.

                                These results are repeatable with very

                                rare variation.<br class="">

                                <br class="">

                                For the single RPC case, I recorded

                                some timers on the server side, and

                                attempted to overlay them with

                                the client side (there is

                                some unknown offset, but probably on the

                                order of 10s of microseconds at

                                worst, given the pattern):<br class="">

                                <br class="">

                                <PastedGraphic-7.png><br class="">

                                <br class="">

                                I blew up the first few timesteps of one

                                of the trials:<br class="">

                                <PastedGraphic-8.png><br class="">

                                <br class="">

                                The different colors

                                are different segments of the

                                handler, but there doesn’t seem to be

                                anything too interesting going on inside

                                the handler. So it looks like the time

                                is being introduced before the 3rd RPC

                                handler starts, based on the where the

                                gap appears on the server side.<br class="">

                                <br class="">

                                To try and isolate

                                any dataspaces-specific behavior,

                                I created a pure Margo test case that

                                just sends a single rpc of the same size

                                as dataspaces iteratively, whre

                                the server side does an 8-byte bulk

                                transfer initiated by the server, and

                                sends a response. The results

                                are similar, except that it is now the

                                4th and 8th timestep that are slow

                                (and the first timestep is VERY

                                long, presumably because

                                rxm communication state is

                                being established. DataSpaces has an

                                earlier RPC in its init that

                                was absorbing this latency).<br class="">

                                <br class="">

                                I got margo profiling results for this

                                test case:<br class="">

                                <br class="">

                                ```<br class="">

                                3<br class="">

18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573<br class="">

                                0xa2a1,term_rpc<br class="">

                                0x27b5,put_rpc<br class="">

                                0xd320,__shutdown__<br class="">

0x27b5 ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">

                                0x27b5 ,0;0.041241646,200.000000000, 0;<br class="">

0xa2a1 ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">

                                0xa2a1 ,0;0.000009298,1.000000000, 0;<br class="">

                                ```<br class="">

                                <br class="">

                                So I guess my question at this point

                                is, is there any sensible reason why

                                the 4th and 8th RPC sent would have a

                                long response time? I think I’ve cleared

                                my code on the client side and

                                server side, so it appears to be latency

                                being introduced

                                by Margo, LibFabric, Argobots, or the

                                underlying OS. I do see long

                                timesteps occasionally after

                                this (perhaps every 20-30 timesteps)

                                but these are not consistent.<br class="">

                                <br class="">

                                One last detail: this does not happen

                                on Summit. On summit, I see about

                                5-7x worse single-RPC performance (250-350 microseconds per

                                RPC), but without

                                the intermittent long timesteps.<br class="">

                                <br class="">

                                I can provide the minimal test case

                                if it would be helpful. I am using APEX

                                for timing results, and the

                                following dependencies with Spack:<br class="">

                                <br class="">

<a class="moz-txt-link-abbreviated" href="mailto:argobots@1.1">argobots@1.1</a>  <a class="moz-txt-link-abbreviated" href="mailto:json-c@0.15">json-c@0.15</a>  <a class="moz-txt-link-abbreviated" href="mailto:libfabric@1.13.1">libfabric@1.13.1</a>  <a class="moz-txt-link-abbreviated" href="mailto:mercury@2.0.1">mercury@2.0.1</a>  <a class="moz-txt-link-abbreviated" href="mailto:mochi-margo@0.9.5">mochi-margo@0.9.5</a>  rdma-core@20<br class="">

                                <br class="">

                                Thanks,<br class="">

                                Philip<br class="">

                                <br class="">

                                <br class="">

                                <br class="">

                                <br class="">

                                <div class="">_______________________________________________</div>

                                <div class="">mochi-devel mailing list</div>

                                <br class="Apple-interchange-newline">

                                <a class="moz-txt-link-abbreviated" href="mailto:mochi-devel@lists.mcs.anl.gov">mochi-devel@lists.mcs.anl.gov</a><br class="">

<a class="moz-txt-link-freetext" href="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">

<a class="moz-txt-link-freetext" href="https://www.mcs.anl.gov/research/projects/mochi">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">

                              </blockquote>

_______________________________________________<br class="">

                              mochi-devel mailing list<br class="">

                              <a class="moz-txt-link-abbreviated" href="mailto:mochi-devel@lists.mcs.anl.gov">mochi-devel@lists.mcs.anl.gov</a><br class="">

<a class="moz-txt-link-freetext" href="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">

<a class="moz-txt-link-freetext" href="https://www.mcs.anl.gov/research/projects/mochi">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">

                            </blockquote>

                            <br class="">

                          </blockquote>

                        </blockquote>

                        <br class="">

                      </blockquote>

                    </blockquote>

                    <br class="">

                  </blockquote>

                </blockquote>

                <br class="">

              </blockquote>

            </blockquote>

            <br class="">

          </blockquote>

        </blockquote>

        <br class="">

      </span>

    </blockquote>

  </body>

</html>