<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>Thanks Philip, the "= {0};" initialization of that struct got me
      going.</p>
    <p>I can run the test case now and it is producing output in the
      client and server perf dirs.  Just to sanity check what to look
      for, I think the problem should be exhibited in the "put_wait" or
      maybe "do_put" trace events on the client?  For example on my
      laptop I see this:</p>
    <p>carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep do_put
      trace_events.0.json<br>
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent
      GUID":0}},<br>
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent
      GUID":0}},<br>
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent
      GUID":0}},<br>
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent
      GUID":0}},<br>
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent
      GUID":0}},<br>
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent
      GUID":0}},<br>
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent
      GUID":0}},<br>
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent
      GUID":0}},<br>
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent
      GUID":0}},<br>
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent
      GUID":0}},<br>
      carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait
      trace_events.0.json<br>
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent
      GUID":0}},<br>
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent
      GUID":0}},<br>
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent
      GUID":0}},<br>
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent
      GUID":0}},<br>
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent
      GUID":0}},<br>
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent
      GUID":0}},<br>
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent
      GUID":0}},<br>
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent
      GUID":0}},<br>
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent
      GUID":0}},<br>
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent
      GUID":0}},</p>
    <p>I should look at the "dur" field right?  What are the units on
      that?</p>
    <p>I'll see if I can run this on a "real" system shortly.<br>
    </p>
    <p>thanks!</p>
    <p>-Phil<br>
    </p>
    <div class="moz-cite-prefix">On 11/2/21 12:11 PM, Philip Davis
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:104B5045-328B-48EE-A32C-09D25817EED7@rutgers.edu">
      
      Hi Phil,
      <div class=""><br class="">
      </div>
      <div class="">Sorry the data structures are like that; I wanted to
        preserve as much of the RPC size and ordering in case it ended
        up being important.</div>
      <div class=""><br class="">
      </div>
      <div class="">I’m surprised in.odsc.size is troublesome, as I
        set in.odsc.size with the line `in.odsc.size = sizeof(odsc);`.
        I’m not sure what could be corrupting that value in the
        meantime.</div>
      <div class=""><br class="">
      </div>
      <div class="">I don’t set in.odsc.gdim_size (which was an
        oversight, since that’s non-zero normally), so I’m less
        surprised that’s an issue. I thought I initialized `in` to zero,
        but I see I didn’t do that after all.</div>
      <div class=""><br class="">
      </div>
      <div class="">Maybe change the line `bulk_gdim_t in;` to
        `bulk_gdim_t in = {0};`</div>
      <div class=""><br class="">
      </div>
      <div class=""><br class="">
      </div>
      <div class="">
        <div><br class="">
          <blockquote type="cite" class="">
            <div class="">On Nov 2, 2021, at 11:48 AM, Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>
              wrote:</div>
            <br class="Apple-interchange-newline">
            <div class="">
              <div class="">
                <p class="">Awesome, thanks Philip.  It came through
                  fine.</p>
                <p class="">I started by modifying the job script
                  slightly to just run it on my laptop with sm (I wanted
                  to make sure I understood the test case, and how to
                  use apex, before trying elsewhere).  Does in.size
                  needs to be set in client.c?  For me there is a random
                  value in that field and it is causing the encoder on
                  the forward to attempt a very large allocation.  The
                  same might be true of gdim_size if it got past that
                  step.  I started to alter them but then I wasn't sure
                  what the implications were.<br class="">
                </p>
                <p class="">(fwiw I needed to include stdlib.h in
                  common.h, but I've hit that a couple of times recently
                  on codes that didn't previously generate warnings; I
                  think something in Ubuntu has gotten strict about that
                  recently).</p>
                <p class="">thanks,</p>
                <p class="">-Phil<br class="">
                </p>
                <p class=""><br class="">
                </p>
                <div class="moz-cite-prefix">On 11/1/21 4:51 PM, Philip
                  Davis wrote:<br class="">
                </div>
                <blockquote type="cite" cite="mid:67D76603-B4B9-48E5-9452-966D1E4866D2@rutgers.edu" class="">
                  <div class="" style="word-wrap:break-word;
                    line-break:after-white-space">
                    Hi Phil,</div>
                  <div class="" style="word-wrap:break-word;
                    line-break:after-white-space">
                    <div class=""><br class="">
                    </div>
                    <div class="">I’ve attached the reproducer. I see
                      the 4th and 8th issue on Frontera, but not Summit.
                      Hopefully it will build and run without too much
                      modification. Let me know if there are any issues
                      with running it (or if the anl listserv eats the
                      tarball, which I kind of expect).</div>
                    <div class=""><br class="">
                    </div>
                    <div class="">Thanks,</div>
                    <div class="">Philip</div>
                    <div class=""><br class="">
                      <div class=""><br class="">
                        <blockquote type="cite" class="">
                          <div class="">On Nov 1, 2021, at 11:14 AM,
                            Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>
                            wrote:</div>
                          <br class="x_Apple-interchange-newline">
                          <div class="">
                            <div class="">
                              <p class="">Hi Philip,</p>
                              <p class="">(FYI I think the first image
                                didn't come through in your email, but I
                                think the others are sufficient to get
                                across what you are seeing)</p>
                              <p class="">I don't have any idea what
                                would cause that.  The recently released
                                libfabric 1.13.2 (available in spack
                                from the mochi-spack-packages repo)
                                includes some fixes to the rxm provider
                                that could be relevant to Frontera and
                                Summit, but nothing that aligns with
                                what you are observing.</p>
                              <p class="">If it were later in the
                                sequence (much later) I would speculate
                                that memory allocation/deallocation
                                cycles were eventually causing a
                                hiccup.  We've seen something like that
                                in the past, and it's a theory that we
                                could then test with alternative
                                allocators like jemalloc.  That's not
                                memory allocation jitter that early in
                                the run though.<br class="">
                              </p>
                              <p class="">Please do share your
                                reproducer if you don't mind!  We can
                                try a few systems here to at least
                                isolate if it is something peculiar to
                                the InfiniBand path or if there is a
                                more general problem in Margo.</p>
                              <p class="">thanks,</p>
                              <p class="">-Phil<br class="">
                              </p>
                              <div class="x_moz-cite-prefix">On 10/29/21
                                3:20 PM, Philip Davis wrote:<br class="">
                              </div>
                              <blockquote type="cite" class="">
                                <div class="">Hello,</div>
                                <div class=""><br class="">
                                </div>
                                <div class="">I apologize in advance for
                                  the winding nature of this email. I’m
                                  not sure how to ask my question
                                  without explaining the story of my
                                  results some.</div>
                                <div class=""><br class="">
                                </div>
                                I’m doing some characterization of our
                                server performance under load, and I
                                have a quirk of performance that I
                                wanted to run by people to see if they
                                make sense. My testing so far has been
                                to iteratively send batches of RPCs
                                using margo_iforward, and then measure
                                the wait time until they are all
                                complete. On the server side, handling
                                the RPC includes a margo_bulk_transfer
                                as a pull initiated on the server to
                                pull (for now) 8 bytes. The payload of
                                the RPC request is about 500 bytes, and
                                the response payload is 4 bytes.
                                <div class=""><br class="">
                                </div>
                                <div class="">I’ve isolated my results
                                  down to one server rank and one client
                                  rank, because it’s an easier starting
                                  point to reason from. Below is a graph
                                  of some of my initial results. These
                                  results are from Frontera. The median
                                  times are good (a single RPC takes on
                                  the order of 10s of microseconds,
                                  which seems fantastic). However, the
                                  outliers are fairly high (note the log
                                  scale of the y-axis). With only one
                                  RPC per timestep, for example, there
                                  is a 30x spread between the median and
                                  the max.</div>
                                <div class=""><br class="">
                                </div>
                                <div class=""><img id="x_126DD6C1-A420-4D70-8D1C-96320B7F54E7" src="cid:5EB5DF63-97AA-48FC-8BBE-E666E933D79F" class="" moz-do-not-send="true"></div>
                                <div class=""><br class="">
                                </div>
                                <div class="">I was hoping (expecting)
                                  the first timestep would be where the
                                  long replies resided, but that turned
                                  out not to be the case. Below are
                                  traces from the 1 RPC (blue) and 2 RPC
                                   (orange) per timestep cases, 5 trials
                                  of 10 timesteps for each case
                                  (normalized to fix the same x-axis):</div>
                                <div class=""><br class="">
                                </div>
                                <div class=""><span id="x_cid:part1.PbJ8iWDr.ROdFhNeX@mcs.anl.gov" class=""><PastedGraphic-6.png></span></div>
                                <div class=""><br class="">
                                </div>
                                <div class="">What strikes me is how
                                  consistent these results are across
                                  trials. For the 1 RPC per timestep
                                  case, the 3rd and 7th timestep are
                                  consistently<i class=""> </i><span class="" style="font-style:normal">slow
                                    (and the rest are fast). For the 2
                                    RPC per timestep case, the 2nd and
                                    4th timestep are always slow and
                                    sometimes the 10th is. These results
                                    are repeatable with very rare
                                    variation.</span></div>
                                <div class=""><span class="" style="font-style:normal"><br class="">
                                  </span></div>
                                <div class="">For the single RPC case, I
                                  recorded some timers on the server
                                  side, and attempted to overlay them
                                  with the client side (there is some
                                  unknown offset, but probably on the
                                  order of 10s of microseconds at worst,
                                  given the pattern):</div>
                                <div class=""><br class="">
                                </div>
                                <div class=""><span id="x_cid:part2.1KhzfDFv.Yq10fEh8@mcs.anl.gov" class=""><PastedGraphic-7.png></span></div>
                                <div class=""><br class="">
                                </div>
                                <div class="">I blew up the first few
                                  timesteps of one of the trials:</div>
                                <div class=""><span id="x_cid:part3.B6Byp6pi.aa0hKH2Q@mcs.anl.gov" class=""><PastedGraphic-8.png></span></div>
                                <div class=""><br class="">
                                </div>
                                <div class="">The different colors are
                                  different segments of the handler, but
                                  there doesn’t seem to be anything too
                                  interesting going on inside the
                                  handler. So it looks like the time is
                                  being introduced before the 3rd RPC
                                  handler starts, based on the where the
                                  gap appears on the server side.</div>
                                <div class=""><br class="">
                                </div>
                                <div class="">To try and isolate any
                                  dataspaces-specific behavior, I
                                  created a pure Margo test case that
                                  just sends a single rpc of the same
                                  size as dataspaces iteratively, whre
                                  the server side does an 8-byte bulk
                                  transfer initiated by the server, and
                                  sends a response. The results are
                                  similar, except that it is now the 4th
                                  and 8th timestep that are slow (and
                                  the first timestep is VERY long,
                                  presumably because rxm communication
                                  state is being established. DataSpaces
                                  has an earlier RPC in its init that
                                  was absorbing this latency).</div>
                                <div class=""><br class="">
                                </div>
                                <div class="">I got margo profiling
                                  results for this test case:</div>
                                <div class=""><br class="">
                                </div>
                                <div class="">```</div>
                                <div class="">
                                  <div class="">3</div>
                                  <div class="">18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573</div>
                                  <div class="">0xa2a1,term_rpc</div>
                                  <div class="">0x27b5,put_rpc</div>
                                  <div class="">0xd320,__shutdown__</div>
                                  <div class="">0x27b5
,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0</div>
                                  <div class="">0x27b5
                                    ,0;0.041241646,200.000000000, 0;</div>
                                  <div class="">0xa2a1
,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0</div>
                                  <div class="">0xa2a1
                                    ,0;0.000009298,1.000000000, 0;</div>
                                </div>
                                <div class="">```</div>
                                <div class=""><br class="">
                                </div>
                                <div class="">So I guess my question at
                                  this point is, is there any sensible
                                  reason why the 4th and 8th RPC sent
                                  would have a long response time? I
                                  think I’ve cleared my code on the
                                  client side and server side, so it
                                  appears to be latency being introduced
                                  by Margo, LibFabric, Argobots, or the
                                  underlying OS. I do see long timesteps
                                  occasionally after this (perhaps every
                                  20-30 timesteps) but these are not
                                  consistent.</div>
                                <div class=""><br class="">
                                </div>
                                <div class="">One last detail: this does
                                  not happen on Summit. On summit, I see
                                  about 5-7x worse single-RPC
                                  performance (250-350 microseconds per
                                  RPC), but without the intermittent
                                  long timesteps.</div>
                                <div class=""><br class="">
                                </div>
                                <div class="">I can provide the minimal
                                  test case if it would be helpful. I am
                                  using APEX for timing results, and the
                                  following dependencies with Spack:</div>
                                <div class=""><br class="">
                                </div>
                                <div class=""><a class="moz-txt-link-freetext
                                    x_moz-txt-link-abbreviated" href="mailto:argobots@1.1" moz-do-not-send="true">argobots@1.1</a>
                                   <a class="moz-txt-link-freetext
                                    x_moz-txt-link-abbreviated" href="mailto:json-c@0.15" moz-do-not-send="true">json-c@0.15</a>
                                   <a class="moz-txt-link-freetext
                                    x_moz-txt-link-abbreviated" href="mailto:libfabric@1.13.1" moz-do-not-send="true">libfabric@1.13.1</a>
                                   <a class="moz-txt-link-freetext
                                    x_moz-txt-link-abbreviated" href="mailto:mercury@2.0.1" moz-do-not-send="true">mercury@2.0.1</a>
                                   <a class="moz-txt-link-freetext
                                    x_moz-txt-link-abbreviated" href="mailto:mochi-margo@0.9.5" moz-do-not-send="true">mochi-margo@0.9.5</a>
                                   rdma-core@20</div>
                                <div class=""><br class="">
                                </div>
                                <div class="">Thanks,</div>
                                <div class="">Philip</div>
                                <div class=""><br class="">
                                </div>
                                <div class=""><br class="">
                                </div>
                                <br class="">
                                <fieldset class="x_mimeAttachmentHeader"></fieldset>
                                <pre class="x_moz-quote-pre">_______________________________________________
mochi-devel mailing list
<a class="moz-txt-link-freetext x_moz-txt-link-abbreviated" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a>
<a class="x_moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202547739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=dpAo6xraOBJUBYZl9u%2FTKlgE0SxgsPHA%2BkcLHGUlRzE%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="NyZqvUYuffwdW6QtaT9uKVTegoYHMu39M6JpDeqx7fxGDPocMr2brCxbIba4j9baATNThhsKwln6QXSPijuHGJf2KrU+LichGX9Kdax4uDisRyLEp0lzOKluyTYbdyg1fJMcqJuL52bD6IwhqBMUfKIoq2ZkeHNi7WvAh2YEJpQ=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a>
<a class="x_moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202557723%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AkqAutFJvngA4OtsQa4vLGyAgddum5BXjpwqgxXqjtw%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="LPhxsc+GTgGuYzh8ZXI5vXzHVu/W1QXzh8vKlAOtBcF7lotuqJp6nstqkJ7ARZPDdzbY5jgLmdgMOdYnoDYKrUlWIooD1XUAT1CtRet19gaB+7kcjzeQT7cJxbV64r75fcLo0mO/uE/W+Z8cI3WI0g3BKz89ZW8QejDJ0FwF1eE=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a>
</pre>
                              </blockquote>
                            </div>
_______________________________________________<br class="">
                            mochi-devel mailing list<br class="">
                            <a href="mailto:mochi-devel@lists.mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">
                            <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202567719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AOe9pjQ%2FoA3ZJTbxO9EYcxlBq%2FZu8%2Bs%2BXx%2BM8Mv2v0U%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="A8T+t+K8Gd/85hkgtmy5s26CUuyTKh/aQ9iUyXotXGE1c3HXYuVfABE9Ok2V+X2Kot0ryOFQFvhjy5EzXi/YP4d7DTId4nazL1YhZKNxViU6q43HauJ6WRjzM9kzEcnl88MgzLVs5dXYtVRV7DMMDwDkdhBa9FtAB2lzw9QX5iY=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
                            <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202567719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pWkmT6aXdad0gI%2B5nEWwp6MSHnBnn51TijDRd29oQeo%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="XIvXb1ct7GRIthcN3ti6mzEyUz7VfxRy+jSloufhrTDrBlz5XCxykRoI8K0VlrPv0Zb3saDmSkitqnYnHZoa9DLzI6wHuFUb5ZJbXN4G+YkBxiWt7pvkSu9WnrdT7DYJMkFg5OYzHJUR83cOOaTlv3tcieEQNUj3NdQifmk23VM=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
                          </div>
                        </blockquote>
                      </div>
                      <br class="">
                    </div>
                  </div>
                </blockquote>
              </div>
            </div>
          </blockquote>
        </div>
        <br class="">
      </div>
    </blockquote>
  </body>
</html>