<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Thanks Philip, the "= {0};" initialization of that struct got me

      going.</p>

    <p>I can run the test case now and it is producing output in the

      client and server perf dirs.  Just to sanity check what to look

      for, I think the problem should be exhibited in the "put_wait" or

      maybe "do_put" trace events on the client?  For example on my

      laptop I see this:</p>

    <p>carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep do_put

      trace_events.0.json<br>

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent

      GUID":0}},<br>

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent

      GUID":0}},<br>

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent

      GUID":0}},<br>

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent

      GUID":0}},<br>

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent

      GUID":0}},<br>

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent

      GUID":0}},<br>

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent

      GUID":0}},<br>

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent

      GUID":0}},<br>

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent

      GUID":0}},<br>

{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent

      GUID":0}},<br>

      carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait

      trace_events.0.json<br>

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent

      GUID":0}},<br>

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent

      GUID":0}},<br>

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent

      GUID":0}},<br>

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent

      GUID":0}},<br>

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent

      GUID":0}},<br>

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent

      GUID":0}},<br>

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent

      GUID":0}},<br>

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent

      GUID":0}},<br>

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent

      GUID":0}},<br>

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent

      GUID":0}},</p>

    <p>I should look at the "dur" field right?  What are the units on

      that?</p>

    <p>I'll see if I can run this on a "real" system shortly.<br>

    </p>

    <p>thanks!</p>

    <p>-Phil<br>

    </p>

    <div class="moz-cite-prefix">On 11/2/21 12:11 PM, Philip Davis

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:104B5045-328B-48EE-A32C-09D25817EED7@rutgers.edu">

      Hi Phil,

      <div class=""><br class="">

      </div>

      <div class="">Sorry the data structures are like that; I wanted to

        preserve as much of the RPC size and ordering in case it ended

        up being important.</div>

      <div class=""><br class="">

      </div>

      <div class="">I’m surprised in.odsc.size is troublesome, as I

        set in.odsc.size with the line `in.odsc.size = sizeof(odsc);`.

        I’m not sure what could be corrupting that value in the

        meantime.</div>

      <div class=""><br class="">

      </div>

      <div class="">I don’t set in.odsc.gdim_size (which was an

        oversight, since that’s non-zero normally), so I’m less

        surprised that’s an issue. I thought I initialized `in` to zero,

        but I see I didn’t do that after all.</div>

      <div class=""><br class="">

      </div>

      <div class="">Maybe change the line `bulk_gdim_t in;` to

        `bulk_gdim_t in = {0};`</div>

      <div class=""><br class="">

      </div>

      <div class=""><br class="">

      </div>

      <div class="">

        <div><br class="">

          <blockquote type="cite" class="">

            <div class="">On Nov 2, 2021, at 11:48 AM, Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>

              wrote:</div>

            <br class="Apple-interchange-newline">

            <div class="">

              <div class="">

                <p class="">Awesome, thanks Philip.  It came through

                  fine.</p>

                <p class="">I started by modifying the job script

                  slightly to just run it on my laptop with sm (I wanted

                  to make sure I understood the test case, and how to

                  use apex, before trying elsewhere).  Does in.size

                  needs to be set in client.c?  For me there is a random

                  value in that field and it is causing the encoder on

                  the forward to attempt a very large allocation.  The

                  same might be true of gdim_size if it got past that

                  step.  I started to alter them but then I wasn't sure

                  what the implications were.<br class="">

                </p>

                <p class="">(fwiw I needed to include stdlib.h in

                  common.h, but I've hit that a couple of times recently

                  on codes that didn't previously generate warnings; I

                  think something in Ubuntu has gotten strict about that

                  recently).</p>

                <p class="">thanks,</p>

                <p class="">-Phil<br class="">

                </p>

                <p class=""><br class="">

                </p>

                <div class="moz-cite-prefix">On 11/1/21 4:51 PM, Philip

                  Davis wrote:<br class="">

                </div>

                <blockquote type="cite" cite="mid:67D76603-B4B9-48E5-9452-966D1E4866D2@rutgers.edu" class="">

                  <div class="" style="word-wrap:break-word;

                    line-break:after-white-space">

                    Hi Phil,</div>

                  <div class="" style="word-wrap:break-word;

                    line-break:after-white-space">

                    <div class=""><br class="">

                    </div>

                    <div class="">I’ve attached the reproducer. I see

                      the 4th and 8th issue on Frontera, but not Summit.

                      Hopefully it will build and run without too much

                      modification. Let me know if there are any issues

                      with running it (or if the anl listserv eats the

                      tarball, which I kind of expect).</div>

                    <div class=""><br class="">

                    </div>

                    <div class="">Thanks,</div>

                    <div class="">Philip</div>

                    <div class=""><br class="">

                      <div class=""><br class="">

                        <blockquote type="cite" class="">

                          <div class="">On Nov 1, 2021, at 11:14 AM,

                            Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>

                            wrote:</div>

                          <br class="x_Apple-interchange-newline">

                          <div class="">

                            <div class="">

                              <p class="">Hi Philip,</p>

                              <p class="">(FYI I think the first image

                                didn't come through in your email, but I

                                think the others are sufficient to get

                                across what you are seeing)</p>

                              <p class="">I don't have any idea what

                                would cause that.  The recently released

                                libfabric 1.13.2 (available in spack

                                from the mochi-spack-packages repo)

                                includes some fixes to the rxm provider

                                that could be relevant to Frontera and

                                Summit, but nothing that aligns with

                                what you are observing.</p>

                              <p class="">If it were later in the

                                sequence (much later) I would speculate

                                that memory allocation/deallocation

                                cycles were eventually causing a

                                hiccup.  We've seen something like that

                                in the past, and it's a theory that we

                                could then test with alternative

                                allocators like jemalloc.  That's not

                                memory allocation jitter that early in

                                the run though.<br class="">

                              </p>

                              <p class="">Please do share your

                                reproducer if you don't mind!  We can

                                try a few systems here to at least

                                isolate if it is something peculiar to

                                the InfiniBand path or if there is a

                                more general problem in Margo.</p>

                              <p class="">thanks,</p>

                              <p class="">-Phil<br class="">

                              </p>

                              <div class="x_moz-cite-prefix">On 10/29/21

                                3:20 PM, Philip Davis wrote:<br class="">

                              </div>

                              <blockquote type="cite" class="">

                                <div class="">Hello,</div>

                                <div class=""><br class="">

                                </div>

                                <div class="">I apologize in advance for

                                  the winding nature of this email. I’m

                                  not sure how to ask my question

                                  without explaining the story of my

                                  results some.</div>

                                <div class=""><br class="">

                                </div>

                                I’m doing some characterization of our

                                server performance under load, and I

                                have a quirk of performance that I

                                wanted to run by people to see if they

                                make sense. My testing so far has been

                                to iteratively send batches of RPCs

                                using margo_iforward, and then measure

                                the wait time until they are all

                                complete. On the server side, handling

                                the RPC includes a margo_bulk_transfer

                                as a pull initiated on the server to

                                pull (for now) 8 bytes. The payload of

                                the RPC request is about 500 bytes, and

                                the response payload is 4 bytes.

                                <div class=""><br class="">

                                </div>

                                <div class="">I’ve isolated my results

                                  down to one server rank and one client

                                  rank, because it’s an easier starting

                                  point to reason from. Below is a graph

                                  of some of my initial results. These

                                  results are from Frontera. The median

                                  times are good (a single RPC takes on

                                  the order of 10s of microseconds,

                                  which seems fantastic). However, the

                                  outliers are fairly high (note the log

                                  scale of the y-axis). With only one

                                  RPC per timestep, for example, there

                                  is a 30x spread between the median and

                                  the max.</div>

                                <div class=""><br class="">

                                </div>

                                <div class=""><img id="x_126DD6C1-A420-4D70-8D1C-96320B7F54E7" src="cid:5EB5DF63-97AA-48FC-8BBE-E666E933D79F" class="" moz-do-not-send="true"></div>

                                <div class=""><br class="">

                                </div>

                                <div class="">I was hoping (expecting)

                                  the first timestep would be where the

                                  long replies resided, but that turned

                                  out not to be the case. Below are

                                  traces from the 1 RPC (blue) and 2 RPC

                                   (orange) per timestep cases, 5 trials

                                  of 10 timesteps for each case

                                  (normalized to fix the same x-axis):</div>

                                <div class=""><br class="">

                                </div>

                                <div class=""><span id="x_cid:part1.PbJ8iWDr.ROdFhNeX@mcs.anl.gov" class=""><PastedGraphic-6.png></span></div>

                                <div class=""><br class="">

                                </div>

                                <div class="">What strikes me is how

                                  consistent these results are across

                                  trials. For the 1 RPC per timestep

                                  case, the 3rd and 7th timestep are

                                  consistently<i class=""> </i><span class="" style="font-style:normal">slow

                                    (and the rest are fast). For the 2

                                    RPC per timestep case, the 2nd and

                                    4th timestep are always slow and

                                    sometimes the 10th is. These results

                                    are repeatable with very rare

                                    variation.</span></div>

                                <div class=""><span class="" style="font-style:normal"><br class="">

                                  </span></div>

                                <div class="">For the single RPC case, I

                                  recorded some timers on the server

                                  side, and attempted to overlay them

                                  with the client side (there is some

                                  unknown offset, but probably on the

                                  order of 10s of microseconds at worst,

                                  given the pattern):</div>

                                <div class=""><br class="">

                                </div>

                                <div class=""><span id="x_cid:part2.1KhzfDFv.Yq10fEh8@mcs.anl.gov" class=""><PastedGraphic-7.png></span></div>

                                <div class=""><br class="">

                                </div>

                                <div class="">I blew up the first few

                                  timesteps of one of the trials:</div>

                                <div class=""><span id="x_cid:part3.B6Byp6pi.aa0hKH2Q@mcs.anl.gov" class=""><PastedGraphic-8.png></span></div>

                                <div class=""><br class="">

                                </div>

                                <div class="">The different colors are

                                  different segments of the handler, but

                                  there doesn’t seem to be anything too

                                  interesting going on inside the

                                  handler. So it looks like the time is

                                  being introduced before the 3rd RPC

                                  handler starts, based on the where the

                                  gap appears on the server side.</div>

                                <div class=""><br class="">

                                </div>

                                <div class="">To try and isolate any

                                  dataspaces-specific behavior, I

                                  created a pure Margo test case that

                                  just sends a single rpc of the same

                                  size as dataspaces iteratively, whre

                                  the server side does an 8-byte bulk

                                  transfer initiated by the server, and

                                  sends a response. The results are

                                  similar, except that it is now the 4th

                                  and 8th timestep that are slow (and

                                  the first timestep is VERY long,

                                  presumably because rxm communication

                                  state is being established. DataSpaces

                                  has an earlier RPC in its init that

                                  was absorbing this latency).</div>

                                <div class=""><br class="">

                                </div>

                                <div class="">I got margo profiling

                                  results for this test case:</div>

                                <div class=""><br class="">

                                </div>

                                <div class="">```</div>

                                <div class="">

                                  <div class="">3</div>

                                  <div class="">18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573</div>

                                  <div class="">0xa2a1,term_rpc</div>

                                  <div class="">0x27b5,put_rpc</div>

                                  <div class="">0xd320,__shutdown__</div>

                                  <div class="">0x27b5

,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0</div>

                                  <div class="">0x27b5

                                    ,0;0.041241646,200.000000000, 0;</div>

                                  <div class="">0xa2a1

,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0</div>

                                  <div class="">0xa2a1

                                    ,0;0.000009298,1.000000000, 0;</div>

                                </div>

                                <div class="">```</div>

                                <div class=""><br class="">

                                </div>

                                <div class="">So I guess my question at

                                  this point is, is there any sensible

                                  reason why the 4th and 8th RPC sent

                                  would have a long response time? I

                                  think I’ve cleared my code on the

                                  client side and server side, so it

                                  appears to be latency being introduced

                                  by Margo, LibFabric, Argobots, or the

                                  underlying OS. I do see long timesteps

                                  occasionally after this (perhaps every

                                  20-30 timesteps) but these are not

                                  consistent.</div>

                                <div class=""><br class="">

                                </div>

                                <div class="">One last detail: this does

                                  not happen on Summit. On summit, I see

                                  about 5-7x worse single-RPC

                                  performance (250-350 microseconds per

                                  RPC), but without the intermittent

                                  long timesteps.</div>

                                <div class=""><br class="">

                                </div>

                                <div class="">I can provide the minimal

                                  test case if it would be helpful. I am

                                  using APEX for timing results, and the

                                  following dependencies with Spack:</div>

                                <div class=""><br class="">

                                </div>

                                <div class=""><a class="moz-txt-link-freetext

                                    x_moz-txt-link-abbreviated" href="mailto:argobots@1.1" moz-do-not-send="true">argobots@1.1</a>

                                   <a class="moz-txt-link-freetext

                                    x_moz-txt-link-abbreviated" href="mailto:json-c@0.15" moz-do-not-send="true">json-c@0.15</a>

                                   <a class="moz-txt-link-freetext

                                    x_moz-txt-link-abbreviated" href="mailto:libfabric@1.13.1" moz-do-not-send="true">libfabric@1.13.1</a>

                                   <a class="moz-txt-link-freetext

                                    x_moz-txt-link-abbreviated" href="mailto:mercury@2.0.1" moz-do-not-send="true">mercury@2.0.1</a>

                                   <a class="moz-txt-link-freetext

                                    x_moz-txt-link-abbreviated" href="mailto:mochi-margo@0.9.5" moz-do-not-send="true">mochi-margo@0.9.5</a>

                                   rdma-core@20</div>

                                <div class=""><br class="">

                                </div>

                                <div class="">Thanks,</div>

                                <div class="">Philip</div>

                                <div class=""><br class="">

                                </div>

                                <div class=""><br class="">

                                </div>

                                <br class="">

                                <fieldset class="x_mimeAttachmentHeader"></fieldset>

                                <pre class="x_moz-quote-pre">_______________________________________________

mochi-devel mailing list

<a class="moz-txt-link-freetext x_moz-txt-link-abbreviated" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a>

<a class="x_moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202547739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=dpAo6xraOBJUBYZl9u%2FTKlgE0SxgsPHA%2BkcLHGUlRzE%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="NyZqvUYuffwdW6QtaT9uKVTegoYHMu39M6JpDeqx7fxGDPocMr2brCxbIba4j9baATNThhsKwln6QXSPijuHGJf2KrU+LichGX9Kdax4uDisRyLEp0lzOKluyTYbdyg1fJMcqJuL52bD6IwhqBMUfKIoq2ZkeHNi7WvAh2YEJpQ=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a>

<a class="x_moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202557723%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AkqAutFJvngA4OtsQa4vLGyAgddum5BXjpwqgxXqjtw%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="LPhxsc+GTgGuYzh8ZXI5vXzHVu/W1QXzh8vKlAOtBcF7lotuqJp6nstqkJ7ARZPDdzbY5jgLmdgMOdYnoDYKrUlWIooD1XUAT1CtRet19gaB+7kcjzeQT7cJxbV64r75fcLo0mO/uE/W+Z8cI3WI0g3BKz89ZW8QejDJ0FwF1eE=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a>

</pre>

                              </blockquote>

                            </div>

_______________________________________________<br class="">

                            mochi-devel mailing list<br class="">

                            <a href="mailto:mochi-devel@lists.mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">

                            <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202567719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AOe9pjQ%2FoA3ZJTbxO9EYcxlBq%2FZu8%2Bs%2BXx%2BM8Mv2v0U%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="A8T+t+K8Gd/85hkgtmy5s26CUuyTKh/aQ9iUyXotXGE1c3HXYuVfABE9Ok2V+X2Kot0ryOFQFvhjy5EzXi/YP4d7DTId4nazL1YhZKNxViU6q43HauJ6WRjzM9kzEcnl88MgzLVs5dXYtVRV7DMMDwDkdhBa9FtAB2lzw9QX5iY=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">

                            <a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202567719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pWkmT6aXdad0gI%2B5nEWwp6MSHnBnn51TijDRd29oQeo%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="XIvXb1ct7GRIthcN3ti6mzEyUz7VfxRy+jSloufhrTDrBlz5XCxykRoI8K0VlrPv0Zb3saDmSkitqnYnHZoa9DLzI6wHuFUb5ZJbXN4G+YkBxiWt7pvkSu9WnrdT7DYJMkFg5OYzHJUR83cOOaTlv3tcieEQNUj3NdQifmk23VM=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">

                          </div>

                        </blockquote>

                      </div>

                      <br class="">

                    </div>

                  </div>

                </blockquote>

              </div>

            </div>

          </blockquote>

        </div>

        <br class="">

      </div>

    </blockquote>

  </body>

</html>