<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Just to follow up for folks on the list: we've made some headway

      on isolating the source of the delays.  It isn't network specific,

      but rather an issue in the Argobots ULT creation path on some

      platforms.  It's being tracked now at

      <a class="moz-txt-link-freetext" href="https://github.com/pmodels/argobots/issues/369">https://github.com/pmodels/argobots/issues/369</a> with a standalone

      Argobots reproducer.  It should be a lot easier to hunt down from

      this point.<br>

    </p>

    <p>Some details:</p>

    <ul>

      <li>This only happens on particular machines (Frontera@TACC and

        Cooley@ALCF for example) but we don't know why just yet.<br>

      </li>

      <li>It only happens if you use dedicated threads (execution

        streams) for handling RPCs.  This is common for a lot of Mochi

        data services, though</li>

      <li>The test case also happens to highlight that we have some

        memory allocation/caching tuning work to do even on platforms

        that behave correctly.  We have a lot of options on that front,

        but we'll circle back to it later once the major

        platform-specific performance bug is resolved.</li>

    </ul>

    <p>thanks,</p>

    <p>-Phil<br>

    </p>

    <blockquote type="cite" cite="mid:7B252C38-02B7-40EE-A799-7856565F7C0E@rutgers.edu">

      One more thing: I tried the same test case with a non-trivial size

      for the bulk pull (8MB) just to shift the timing some. I still see

      the extra time appear on the 4th and 8th timestep. I had wondered

      if the issue was that the timestep period was small enough that

      some background task was not completing in time, but that does not

      seem to be the case:

      <div class=""><br class="">

        <div class="">

          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024255773.000000,

            "dur": 76836.500000,"args":{"GUID":3,"Parent GUID":0}},</div>

          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024332702.500000,

            "dur": 3420.500000,"args":{"GUID":5,"Parent GUID":0}},</div>

          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024336178.250000,

            "dur": 3369.000000,"args":{"GUID":7,"Parent GUID":0}},</div>

          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024339577.500000,

            "dur": 5052.000000,"args":{"GUID":9,"Parent GUID":0}},</div>

          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024344658.750000,

            "dur": 3326.750000,"args":{"GUID":11,"Parent GUID":0}},</div>

          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024348016.500000,

            "dur": 3697.000000,"args":{"GUID":13,"Parent GUID":0}},</div>

          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024351743.500000,

            "dur": 3333.250000,"args":{"GUID":15,"Parent GUID":0}},</div>

          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024355106.000000,

            "dur": 4500.000000,"args":{"GUID":17,"Parent GUID":0}},</div>

          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024359634.500000,

            "dur": 3336.250000,"args":{"GUID":19,"Parent GUID":0}},</div>

          <div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024362999.000000,

            "dur": 3296.000000,"args":{"GUID":21,"Parent GUID":0}},</div>

          <div><br class="">

            <blockquote type="cite" class="">

              <div class="">On Nov 5, 2021, at 12:40 PM, Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>

                wrote:</div>

              <br class="Apple-interchange-newline">

              <div class="">

                <div class="">

                  <p class=""><br class="">

                  </p>

                  <div class="moz-cite-prefix">On 11/5/21 12:23 PM,

                    Philip Davis wrote:<br class="">

                  </div>

                  <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class="">

                    That’s a good find. I’m reading through the code and

                    documentation, and I’m having a little trouble

                    understanding what the difference between 0 and -1

                    for that last argument is when the second to last

                    argument is 0. I see in documentation:

                    <div class=""><br class="">

                    </div>

                    <div class=""><span style="caret-color: rgb(64, 64,

                        64); color: rgb(64, 64, 64); font-family: Lato,

                        proxima-nova, "Helvetica Neue", Arial,

                        sans-serif; font-size: 16px; background-color:

                        rgb(252, 252, 252);" class="">The third argument

                        indicates whether an Argobots execution stream

                        (ES) should be created to run the Mercury

                        progress loop. If this argument is set to 0, the

                        progress loop is going to run in the context of

                        the main ES (this should be the standard

                        scenario, unless you have a good reason for not

                        using the main ES, such as the main ES using MPI

                        primitives that could block the progress loop).

                        A value of 1 will make Margo create an ES to run

                        the Mercury progress loop. The fourth argument

                        is the number of ES to create and use for

                        executing RPC handlers. A value of 0 will make

                        Margo execute RPCs in the ES that called </span><code class="code literal docutils notranslate" style="box-sizing: border-box; font-family:

                        SFMono-Regular, Menlo, Monaco, Consolas,

                        "Liberation Mono", "Courier

                        New", Courier, monospace; white-space:

                        nowrap; max-width: 100%; background-color:

                        rgb(255, 255, 255); border: 1px solid rgb(225,

                        228, 229); padding: 2px 5px; color: rgb(231, 76,

                        60); overflow-x: auto;"><span class="pre" style="box-sizing: border-box;">margo_init</span></code><span style="caret-color: rgb(64, 64, 64); color:

                        rgb(64, 64, 64); font-family: Lato,

                        proxima-nova, "Helvetica Neue", Arial,

                        sans-serif; font-size: 16px; background-color:

                        rgb(252, 252, 252);" class="">. A value of -1

                        will make Margo execute the RPCs in the ES

                        running the progress loop. A positive value will

                        make Margo create new ESs to run the RPCs.</span></div>

                    <div class=""><font class="" size="3" face="Lato,

                        proxima-nova, Helvetica Neue, Arial, sans-serif" color="#404040"><span style="caret-color:

                          rgb(64, 64, 64); background-color: rgb(252,

                          252, 252);" class=""><br class="">

                        </span></font></div>

                    <span class="">What is the difference between the

                      ‘main ES’ (last two arguments 0,-1) and the 'ES

                      that called margo_init’  (last two arguments 0,0)

                      in the absence of me creating new execution

                      streams? Or maybe I’m not interpreting the

                      documentation correctly?<br class="">

                    </span></blockquote>

                  <p class=""><br class="">

                  </p>

                  <p class="">You are right, those are equivalent when

                    the next to last argument is 0 :)  The main ES and

                    progress thread ES are one and the same in that

                    case, so the RPCs go to the same place either way.</p>

                  <p class="">I've narrowed down a little further and

                    found that the stalls occur when there are 2

                    dedicated RPC ESs but not when there is just 1

                    dedicated RPC ES.  That isolates the problem

                    slightly further, in that it's not just some cost

                    associated with relaying ULTs to another thread

                    (that's happening every time in the 1 handler case

                    too), but something that only happens when multiple

                    ESes could potentially service an RPC.</p>

                  <p class="">thanks,</p>

                  <p class="">-Phil<br class="">

                    <span class=""></span></p>

                  <blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class="">

                    <span class="">

                      <blockquote type="cite" class="">On Nov 5, 2021,

                        at 10:29 AM, Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>

                        wrote:<br class="">

                        <br class="">

                        Srinivasan Ramesh (U. Oregon) has done some work

                        on fine-grained RPC component timing, but it's

                        not in the mainline margo tree so we'll need to

                        do a little more work to look at it.<br class="">

                        <br class="">

                        In the mean time on a hunch I found that I can

                        make the latency consistent on Cooley by

                        altering the margo_init() arguments to be (...,

                        0, -1) in server.c (meaning that no additional

                        execution streams are used at all; all mercury

                        progress and all rpc handlers are executed using

                        user-level threads in the process's

                        primary execution stream (OS thread).<br class="">

                        <br class="">

                        It's expected that there would be some more

                        jitter jumping across OS threads for RPC

                        handling, but it shouldn't be that

                        extreme, regular, or system-specific.<br class="">

                        <br class="">

                        Thanks again for the test case and the Apex

                        instrumentation; this is the sort of thing

                        that's normally really hard to isolate.<br class="">

                        <br class="">

                        thanks,<br class="">

                        <br class="">

                        -Phil<br class="">

                        <br class="">

                        On 11/5/21 10:09 AM, Philip Davis wrote:<br class="">

                        <blockquote type="cite" cite="mid:B0643541-F015-4EB6-AD30-F9F85B196465@rutgers.edu" class="">

                          That’s extremely interesting. <br class="">

                          <br class="">

                          Are there any internal timers in Margo that

                          can tell what the delay was between the

                          server’s progress thread queueing the rpc and

                          the handler thread starting to handle it? If

                          I’m understanding <a href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmochi.readthedocs.io%2Fen%2Flatest%2Fgeneral%2F03_rpc_model.html&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411604270%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fgQJROtM5XQkNijao33AfJkl1HDFEX42Q2M0%2BR0POoo%3D&reserved=0" originalsrc="https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html" shash="oKne11hHUhUv1Ux+D9kgS9epPH1xWB0TI25yTBd0L4NOHfRBMy2+aJHs6rm2DNESeRYRUFw6LyClIuNSy4KENw9yKX/ow/F9jAkorj3zgD/SoAcwfh0mZ3vR+WKI5BzRxpAgn2MHn/Lw3EsYBIDPc5qgEWwmE8dcW8ige9YEzXo=" class="moz-txt-link-freetext" moz-do-not-send="true">https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html</a> correctly,

                          it seems to me like that is the most likely

                          place for non-deterministic delay to be

                          introduced by argobots in the client ->

                          server direction.<br class="">

                          <br class="">

                          I just ran a quick test where I changed the

                          number of handler threads to 5, and I saw no

                          change in behavior (still 4 and 8, not 5

                          and 10).<br class="">

                          <br class="">

                          <br class="">

                          <blockquote type="cite" class="">On Nov 4,

                            2021, at 9:04 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">

                              <carns@mcs.anl.gov></a> wrote:<br class="">

                            <br class="">

                            I have some more interesting info to

                            share from trying a few different

                            configurations.<br class="">

                            <br class="">

                            sm (on my laptop) and ofi+gni (on theta) do

                            not exhibit this behavior; they have

                            consistent performance across RPCs.<br class="">

                            <br class="">

                            ofi+verbs (on cooley) shows the same

                            thing you were seeing; the 4th and 8th RPCs

                            are slow.<br class="">

                            <br class="">

                            Based on the above, it would sound like

                            a problem with the libfabric/verbs path.

                             But on Jerome's suggestion I tried some

                            other supported transports on cooley as

                            well.  In particular I ran the same

                            benchmark (the same build in fact, I

                            just compiled in support for multiple

                            transports and cycling through them in a

                            single job script with runtime options) with

                            these combinations:<br class="">

                            <br class="">

                            <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                              ofi+verbs<br class="">

                            </div>

                            <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                              ofi+tcp<br class="">

                            </div>

                            <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                              ofi+sockets<br class="">

                            </div>

                            <div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•

                              bmi+tcp<br class="">

                            </div>

                            All of them show the same thing!  4th

                            and 8th RPCs are at least an order of

                            magnitude slower than the other RPCs.  That

                            was a surprising result.  The bmi+tcp one

                            isn't even using libfabric at all, even

                            though they are all using the same

                            underlying hardware.<br class="">

                            <br class="">

                            I'm not sure what to make of that

                            yet.  Possibly something with threading or

                            signalling?<br class="">

                            <br class="">

                            thanks,<br class="">

                            <br class="">

                            -Phil<br class="">

                            <br class="">

                            On 11/2/21 2:37 PM, Philip Davis wrote:<br class="">

                            <blockquote type="cite" cite="mid:82DF3EFD-9A7E-414C-86F7-1894000C30E1@rutgers.edu" class="">

                              I’m glad you were able to reproduce it on

                              a different system, thanks for letting me

                              know. I’m not sure what the overlaps

                              between Frontera and Cooley are (that

                              aren’t shared by Summit); a quick look

                              shows they are both Intel, and both FDR,

                              but there’s probably more salient details.<br class="">

                              <br class="">

                              <blockquote type="cite" class="">On Nov 2,

                                2021, at 2:24 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">

                                  <carns@mcs.anl.gov></a> wrote:<br class="">

                                <br class="">

                                Ok.  Interesting.  I didn't see anything

                                unusual in the timing on my laptop with

                                sm (other than it being a bit slow, but

                                I wasn't tuning or worrying about core

                                affinity or anything).  On Cooley, a

                                somewhat older Linux cluster with

                                InfiniBand, I see the 4th and 8th RPC

                                delay you were talking about:<br class="">

                                <br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent GUID":0}},<br class="">

                                <br class="">

                                (assuming the first is high due to

                                connection establishment)<br class="">

                                <br class="">

                                I'll check some

                                other systems/transports, but I wanted

                                to go ahead and share that I've been

                                able to reproduce what you were seeing.<br class="">

                                <br class="">

                                thanks,<br class="">

                                <br class="">

                                -Phil<br class="">

                                <br class="">

                                On 11/2/21 1:49 PM, Philip Davis wrote:<br class="">

                                <blockquote type="cite" cite="mid:35E29989-49C6-4AA0-8C41-54D97F1ACBF6@rutgers.edu" class="">

                                  Glad that’s working now.<br class="">

                                  <br class="">

                                  It is the put_wait events, and “dur”

                                  is the right field. Those units

                                  are microseconds.<br class="">

                                  <br class="">

                                  <br class="">

                                  <blockquote type="cite" class="">On

                                    Nov 2, 2021, at 1:12 PM, Phil Carns

                                    <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">

                                      <carns@mcs.anl.gov></a> wrote:<br class="">

                                    <br class="">

                                    Thanks Philip, the "=

                                    {0};" initialization of that struct

                                    got me going.<br class="">

                                    <br class="">

                                    I can run the test case now and it

                                    is producing output in the client

                                    and server perf dirs.  Just to

                                    sanity check what to look for,

                                    I think the problem should be

                                    exhibited in the "put_wait" or

                                    maybe "do_put" trace events on the

                                    client?  For example on my laptop I

                                    see this:<br class="">

                                    <br class="">

carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep

                                    do_put trace_events.0.json<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent GUID":0}},<br class="">

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent GUID":0}},<br class="">

carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait trace_events.0.json<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent GUID":0}},<br class="">

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent GUID":0}},<br class="">

                                    <br class="">

                                    I should look at the "dur" field

                                    right?  What are the units on that?<br class="">

                                    <br class="">

                                    I'll see if I can run this on a

                                    "real" system shortly.<br class="">

                                    <br class="">

                                    thanks!<br class="">

                                    <br class="">

                                    -Phil<br class="">

                                    <br class="">

                                    On 11/2/21 12:11 PM, Philip Davis

                                    wrote:<br class="">

                                    <blockquote type="cite" cite="mid:104B5045-328B-48EE-A32C-09D25817EED7@rutgers.edu" class="">

                                      Hi Phil,<br class="">

                                      <br class="">

                                      Sorry the data structures are

                                      like that; I wanted to preserve as

                                      much of the RPC size and ordering

                                      in case it ended up

                                      being important.<br class="">

                                      <br class="">

                                      I’m surprised in.odsc.size is

                                      troublesome, as I set in.odsc.size

                                      with the line `in.odsc.size

                                      = sizeof(odsc);`. I’m not sure

                                      what could be corrupting that

                                      value in the meantime.<br class="">

                                      <br class="">

                                      I don’t set

                                      in.odsc.gdim_size (which was an

                                      oversight, since that’s

                                      non-zero normally), so I’m

                                      less surprised that’s an issue. I

                                      thought I initialized `in`

                                      to zero, but I see I didn’t do

                                      that after all.<br class="">

                                      <br class="">

                                      Maybe change the line

                                      `bulk_gdim_t in;` to `bulk_gdim_t

                                      in = {0};`<br class="">

                                      <br class="">

                                      <br class="">

                                      <br class="">

                                      <blockquote type="cite" class="">On

                                        Nov 2, 2021, at 11:48 AM, Phil

                                        Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true"><carns@mcs.anl.gov></a> wrote:<br class="">

                                        <br class="">

                                        Awesome, thanks Philip.  It came

                                        through fine.<br class="">

                                        <br class="">

                                        I started by modifying the job

                                        script slightly to just run it

                                        on my laptop with sm (I

                                        wanted to make sure I understood

                                        the test case, and how

                                        to use apex,

                                        before trying elsewhere).  Does

                                        in.size needs to be set

                                        in client.c?  For me there is

                                        a random value in that field and

                                        it is causing the              

                                                    encoder on

                                        the forward to attempt a

                                        very large allocation.  The same

                                        might be true of gdim_size if it

                                        got past that step.  I started

                                        to alter them but then I

                                        wasn't sure what

                                        the implications were.<br class="">

                                        <br class="">

                                        (fwiw I needed

                                        to include stdlib.h in common.h,

                                        but I've hit that a couple

                                        of times recently on codes

                                        that didn't previously generate warnings;

                                        I think something in Ubuntu

                                        has gotten strict about

                                        that recently).<br class="">

                                        <br class="">

                                        thanks,<br class="">

                                        <br class="">

                                        -Phil<br class="">

                                        <br class="">

                                        <br class="">

                                        <br class="">

                                        On 11/1/21 4:51 PM, Philip Davis

                                        wrote:<br class="">

                                        <blockquote type="cite" cite="mid:67D76603-B4B9-48E5-9452-966D1E4866D2@rutgers.edu" class="">

                                          Hi Phil,<br class="">

                                          <br class="">

                                          I’ve attached the reproducer.

                                          I see the 4th and 8th issue on

                                          Frontera, but

                                          not Summit. Hopefully it will

                                          build and run without too

                                          much modification. Let me know

                                          if there are any issues

                                          with running it (or if the

                                          anl listserv eats the

                                          tarball, which I kind of

                                          expect).<br class="">

                                          <br class="">

                                          Thanks,<br class="">

                                          Philip<br class="">

                                          <br class="">

                                          <br class="">

                                          <blockquote type="cite" class="">On Nov 1, 2021, at

                                            11:14 AM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true"><carns@mcs.anl.gov></a> wrote:<br class="">

                                            <br class="">

                                            Hi Philip,<br class="">

                                            <br class="">

                                            (FYI I think the first

                                            image didn't come through

                                            in your email, but I

                                            think the others

                                            are sufficient to get

                                            across what you are seeing)<br class="">

                                            <br class="">

                                            I don't have any idea

                                            what would cause that.

                                             The recently released libfabric 1.13.2 (available

                                            in spack from

                                            the mochi-spack-packages repo)

                                            includes some fixes to the

                                            rxm provider that could

                                            be relevant to Frontera

                                            and Summit, but nothing

                                            that aligns with what you

                                            are observing.<br class="">

                                            <br class="">

                                            If it were later in

                                            the sequence (much later) I

                                            would speculate

                                            that memory allocation/deallocation cycles

                                            were eventually causing

                                            a hiccup.  We've seen

                                            something like that in the

                                            past, and it's a theory that

                                            we could then test

                                            with alternative allocators like jemalloc.  That's

                                            not memory allocation jitter

                                            that early in the run

                                            though.<br class="">

                                            <br class="">

                                            Please do share

                                            your reproducer if you

                                            don't mind!  We can try a

                                            few systems here to at

                                            least isolate if it is

                                            something peculiar

                                            to the InfiniBand path or

                                            if there is a more

                                            general problem in Margo.<br class="">

                                            <br class="">

                                            thanks,<br class="">

                                            <br class="">

                                            -Phil<br class="">

                                            <br class="">

                                            On 10/29/21 3:20 PM,

                                            Philip Davis wrote:<br class="">

                                            <blockquote type="cite" class="">Hello,<br class="">

                                              <br class="">

                                              I apologize in advance

                                              for the winding nature of

                                              this email. I’m not sure

                                              how to ask

                                              my question without explaining

                                              the story of my results

                                              some.<br class="">

                                              <br class="">

                                              I’m doing

                                              some characterization of

                                              our server performance

                                              under load, and I have

                                              a quirk

                                              of performance that I

                                              wanted to run by people to

                                              see if they make sense.

                                              My testing so far has been

                                              to iteratively send

                                              batches of RPCs

                                              using margo_iforward, and

                                              then measure the wait time

                                              until they are

                                              all complete. On the

                                              server side, handling the

                                              RPC includes

                                              a margo_bulk_transfer as a

                                              pull initiated

                                              on the server to pull (for

                                              now) 8 bytes. The payload

                                              of the RPC request

                                              is about 500 bytes, and

                                              the response payload is

                                              4 bytes.<br class="">

                                              <br class="">

                                              I’ve isolated my results

                                              down to one server rank

                                              and one client

                                              rank, because it’s an

                                              easier starting point to

                                              reason from. Below

                                              is a graph of some of

                                              my initial results.

                                              These results are from

                                              Frontera. The median times

                                              are good (a single

                                              RPC takes on the order of

                                              10s                      

                                                of microseconds, which

                                              seems fantastic). However,

                                              the outliers are fairly

                                              high (note the log scale

                                              of the y-axis). With only

                                              one RPC per timestep, for

                                              example, there is a

                                              30x spread between the

                                              median and the max.<br class="">

                                              <br class="">

                                              <img id="x_126DD6C1-A420-4D70-8D1C-96320B7F54E7" src="cid:5EB5DF63-97AA-48FC-8BBE-E666E933D79F" class="" moz-do-not-send="true"><br class="">

                                              <br class="">

                                              I was

                                              hoping (expecting) the

                                              first timestep would be

                                              where the long

                                              replies resided, but that

                                              turned out not to be the

                                              case. Below are traces

                                              from the 1 RPC (blue) and

                                              2 RPC  (orange) per

                                              timestep cases, 5 trials

                                              of 10 timesteps for each

                                              case (normalized to fix

                                              the same                  

                                                            x-axis):<br class="">

                                              <br class="">

<PastedGraphic-6.png><br class="">

                                              <br class="">

                                              What strikes me is how

                                              consistent these

                                              results are across trials.

                                              For the 1 RPC per timestep

                                              case, the 3rd and 7th

                                              timestep are consistently slow (and

                                              the rest are fast).

                                              For the 2 RPC per timestep

                                              case, the 2nd and 4th

                                              timestep are always slow

                                              and sometimes the        

                                                10th is. These results

                                              are repeatable with very

                                              rare variation.<br class="">

                                              <br class="">

                                              For the single RPC case,

                                              I recorded some timers on

                                              the server side, and

                                              attempted to overlay them

                                              with the client

                                              side (there is

                                              some unknown offset,

                                              but probably on the order

                                              of 10s of microseconds at

                                              worst, given the pattern):<br class="">

                                              <br class="">

<PastedGraphic-7.png><br class="">

                                              <br class="">

                                              I blew up the first

                                              few timesteps of one of

                                              the trials:<br class="">

<PastedGraphic-8.png><br class="">

                                              <br class="">

                                              The different colors

                                              are different segments

                                              of the handler, but

                                              there doesn’t seem to be

                                              anything too interesting going

                                              on inside the handler. So

                                              it looks like the time is

                                              being introduced before

                                              the 3rd RPC

                                              handler starts, based on

                                              the where the gap appears

                                              on the server side.<br class="">

                                              <br class="">

                                              To try and isolate

                                              any dataspaces-specific behavior,

                                              I created a pure Margo

                                              test case that just sends

                                              a single rpc of the same

                                              size

                                              as dataspaces iteratively, whre

                                              the server side does an

                                              8-byte bulk

                                              transfer initiated by the

                                              server, and sends

                                              a response. The results

                                              are similar, except that

                                              it is now the 4th and

                                              8th timestep that are slow

                                              (and the first timestep

                                              is VERY

                                              long, presumably because

                                              rxm communication state is

being established. DataSpaces has an earlier RPC in its init that

                                              was absorbing

                                              this latency).<br class="">

                                              <br class="">

                                              I got

                                              margo profiling results

                                              for this test case:<br class="">

                                              <br class="">

                                              ```<br class="">

                                              3<br class="">

18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573<br class="">

                                              0xa2a1,term_rpc<br class="">

                                              0x27b5,put_rpc<br class="">

                                              0xd320,__shutdown__<br class="">

0x27b5 ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">

0x27b5 ,0;0.041241646,200.000000000, 0;<br class="">

0xa2a1 ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">

0xa2a1 ,0;0.000009298,1.000000000, 0;<br class="">

                                              ```<br class="">

                                              <br class="">

                                              So I guess my question

                                              at this point is, is there

                                              any sensible reason why

                                              the 4th and 8th RPC sent

                                              would have a long response

                                              time? I think I’ve cleared

                                              my code on the client

                                              side and server side, so

                                              it appears to be latency

                                              being introduced

                                              by Margo, LibFabric, Argobots,

                                              or the underlying OS. I do

                                              see long

                                              timesteps occasionally after

                                              this (perhaps

                                              every 20-30 timesteps)

                                              but these are

                                              not consistent.<br class="">

                                              <br class="">

                                              One last detail: this does

                                              not happen on Summit.

                                              On summit, I see about

                                              5-7x worse single-RPC performance (250-350 microseconds per

                                              RPC), but without

                                              the intermittent long timesteps.<br class="">

                                              <br class="">

                                              I can provide the

                                              minimal test case if it

                                              would be helpful. I

                                              am using APEX

                                              for timing results,

                                              and the

                                              following dependencies with

                                              Spack:<br class="">

                                              <br class="">

                                              <a class="moz-txt-link-abbreviated

                                                moz-txt-link-freetext" href="mailto:argobots@1.1" moz-do-not-send="true">argobots@1.1</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:json-c@0.15" moz-do-not-send="true">json-c@0.15</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:libfabric@1.13.1" moz-do-not-send="true">libfabric@1.13.1</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:mercury@2.0.1" moz-do-not-send="true">mercury@2.0.1</a>  <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:mochi-margo@0.9.5" moz-do-not-send="true">mochi-margo@0.9.5</a>  rdma-core@20<br class="">

                                              <br class="">

                                              Thanks,<br class="">

                                              Philip<br class="">

                                              <br class="">

                                              <br class="">

                                              <br class="">

                                              <br class="">

                                              <div class="">_______________________________________________</div>

                                              <div class="">mochi-devel

                                                mailing list</div>

                                              <br class="Apple-interchange-newline">

                                              <a class="moz-txt-link-abbreviated

                                                moz-txt-link-freetext" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">

                                              <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=EiX9%2B4KlSbKB7w2ov4OZtE7jCugYdchESAfX0D%2B%2FK7c%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="hT2kkjAIf/zHJhvtfpngbaSAG6BZQ0v653UkY5DV3lvADy7nxpFwHs9680ZLFYC4A/ZkOaDhVaoVT0dhguoFth03P3pU/ankBuXMiNnP4KyfpClP97DwHHcU9nwM04kJb9eamI6B/jo5QLEuZAQ97t2SUt+s9UTzpXLcXc/6/7I=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">

                                              <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vyMI4AQ%2BbXY4WVJocNVN%2BYJp6H1tYsP7meHgbjriejE%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="ncm2uVm0Ugenklbk5rqBtY39R8xrytsxPVYOdy2NqnuC3nOSGasq7//KT8rW9cbT43LvCOivxAxgLyWaANsZOv0dwLdcqrgqnldIxtfDSfJzCXFTgmgI++3ODNu7UIcNUeMtmBw8YYusppzZlNxvPHAunqN44IwZIk9lYE5wtW0=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">

                                            </blockquote>

_______________________________________________<br class="">

                                            mochi-devel mailing list<br class="">

                                            <a class="moz-txt-link-abbreviated

                                              moz-txt-link-freetext" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">

                                            <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411624266%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oVLJaNkVau19KFuQsfeEpIkYUBvfpHcNBiMwy39eYiE%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="eo+nqLYWCkMMdToapGKW5gCChdEC+hnMkybaq9TTMc+BLlej8fWIrt1WeKWex1ubGz/Kolyb76ecsFMW3CL4Je2n/gnoO704YF3Or7fc73FFazgqUg3EPFQRAx3+zz5WR7a1+QiDa2HcoGRw4/KG7diKB6bVs37W+tX6YsHhBJU=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">

                                            <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411634259%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XjXCEsV2TNbLfPuH7iwMuolSCtaxnJZksVRu7V%2FfGNw%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="euRQ8SoMMl/IdjKCp7PxhwDgnaOLyG1nlSdl2QPAU17a7nYFM9JBPnoh1O/jSGuGv0+KkUsZMdBbbwg0ENFlNOUiYCxv1aPqeW0whx609AFP7y+OEMFUuRm9Ir3OvZlY4jzkE5B2v68VbQSLyROuz/2i8WR42H/ni5Na2Gn9noc=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">

                                          </blockquote>

                                          <br class="">

                                        </blockquote>

                                      </blockquote>

                                      <br class="">

                                    </blockquote>

                                  </blockquote>

                                  <br class="">

                                </blockquote>

                              </blockquote>

                              <br class="">

                            </blockquote>

                          </blockquote>

                          <br class="">

                        </blockquote>

                      </blockquote>

                      <br class="">

                    </span></blockquote>

                </div>

              </div>

            </blockquote>

          </div>

          <br class="">

        </div>

      </div>

    </blockquote>

  </body>

</html>