<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
One more thing: I tried the same test case with a non-trivial size for the bulk pull (8MB) just to shift the timing some. I still see the extra time appear on the 4th and 8th timestep. I had wondered if the issue was that the timestep period was small enough
 that some background task was not completing in time, but that does not seem to be the case:
<div class=""><br class="">
<div class="">
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024255773.000000, "dur": 76836.500000,"args":{"GUID":3,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024332702.500000, "dur": 3420.500000,"args":{"GUID":5,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024336178.250000, "dur": 3369.000000,"args":{"GUID":7,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024339577.500000, "dur": 5052.000000,"args":{"GUID":9,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024344658.750000, "dur": 3326.750000,"args":{"GUID":11,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024348016.500000, "dur": 3697.000000,"args":{"GUID":13,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024351743.500000, "dur": 3333.250000,"args":{"GUID":15,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024355106.000000, "dur": 4500.000000,"args":{"GUID":17,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024359634.500000, "dur": 3336.250000,"args":{"GUID":19,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024362999.000000, "dur": 3296.000000,"args":{"GUID":21,"Parent GUID":0}},</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Nov 5, 2021, at 12:40 PM, Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="">carns@mcs.anl.gov</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div class="">
<p class=""><br class="">
</p>
<div class="moz-cite-prefix">On 11/5/21 12:23 PM, Philip Davis wrote:<br class="">
</div>
<blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class="">
That’s a good find. I’m reading through the code and documentation, and I’m having a little trouble understanding what the difference between 0 and -1 for that last argument is when the second to last argument is 0. I see in documentation:
<div class=""><br class="">
</div>
<div class=""><span style="caret-color: rgb(64, 64, 64); color:
          rgb(64, 64, 64); font-family: Lato, proxima-nova,
          "Helvetica Neue", Arial, sans-serif; font-size:
          16px; background-color: rgb(252, 252, 252);" class="">The third
 argument indicates whether an Argobots execution stream (ES) should be created to run the Mercury progress loop. If this argument is set to 0, the progress loop is going to run in the context of the main ES (this should be the standard scenario, unless you
 have a good reason for not using the main ES, such as the main ES using MPI primitives that could block the progress loop). A value of 1 will make Margo create an ES to run the Mercury progress loop. The fourth argument is the number of ES to create and use
 for executing RPC handlers. A value of 0 will make Margo execute RPCs in the ES that called </span><code class="code literal docutils notranslate" style="box-sizing:
          border-box; font-family: SFMono-Regular, Menlo, Monaco,
          Consolas, "Liberation Mono", "Courier
          New", Courier, monospace; white-space: nowrap; max-width:
          100%; background-color: rgb(255, 255, 255); border: 1px solid
          rgb(225, 228, 229); padding: 2px 5px; color: rgb(231, 76, 60);
          overflow-x: auto;"><span class="pre" style="box-sizing:
            border-box;">margo_init</span></code><span style="caret-color: rgb(64, 64, 64); color: rgb(64, 64, 64);
          font-family: Lato, proxima-nova, "Helvetica Neue",
          Arial, sans-serif; font-size: 16px; background-color: rgb(252,
          252, 252);" class="">.
 A value of -1 will make Margo execute the RPCs in the ES running the progress loop. A positive value will make Margo create new ESs to run the RPCs.</span></div>
<div class=""><font class="" size="3" face="Lato, proxima-nova,
          Helvetica Neue, Arial, sans-serif" color="#404040"><span style="caret-color: rgb(64, 64, 64); background-color:
            rgb(252, 252, 252);" class=""><br class="">
</span></font></div>
<span class="">What is the difference between the ‘main ES’ (last two arguments 0,-1) and the 'ES that called margo_init’  (last two arguments 0,0) in the absence of me creating new execution streams? Or maybe I’m not interpreting the documentation correctly?<br class="">
</span></blockquote>
<p class=""><br class="">
</p>
<p class="">You are right, those are equivalent when the next to last argument is 0 :)  The main ES and progress thread ES are one and the same in that case, so the RPCs go to the same place either way.</p>
<p class="">I've narrowed down a little further and found that the stalls occur when there are 2 dedicated RPC ESs but not when there is just 1 dedicated RPC ES.  That isolates the problem slightly further, in that it's not just some cost associated with relaying
 ULTs to another thread (that's happening every time in the 1 handler case too), but something that only happens when multiple ESes could potentially service an RPC.</p>
<p class="">thanks,</p>
<p class="">-Phil<br class="">
<span class=""></span></p>
<blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class="">
<span class="">
<blockquote type="cite" class="">On Nov 5, 2021, at 10:29 AM, Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>> wrote:<br class="">
<br class="">
Srinivasan Ramesh (U. Oregon) has done some work on fine-grained RPC component timing, but it's not in the mainline margo tree so we'll need to do a little more work to look at it.<br class="">
<br class="">
In the mean time on a hunch I found that I can make the latency consistent on Cooley by altering the margo_init() arguments to be (..., 0, -1) in server.c (meaning that no additional execution streams are used at all; all mercury progress and all rpc handlers are
 executed using user-level threads in the process's primary execution stream (OS thread).<br class="">
<br class="">
It's expected that there would be some more jitter jumping across OS threads for RPC handling, but it shouldn't be that extreme, regular, or system-specific.<br class="">
<br class="">
Thanks again for the test case and the Apex instrumentation; this is the sort of thing that's normally really hard to isolate.<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/5/21 10:09 AM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:B0643541-F015-4EB6-AD30-F9F85B196465@rutgers.edu" class="">
That’s extremely interesting. <br class="">
<br class="">
Are there any internal timers in Margo that can tell what the delay was between the server’s progress thread queueing the rpc and the handler thread starting to handle it? If I’m understanding <a href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmochi.readthedocs.io%2Fen%2Flatest%2Fgeneral%2F03_rpc_model.html&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411604270%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fgQJROtM5XQkNijao33AfJkl1HDFEX42Q2M0%2BR0POoo%3D&reserved=0" originalsrc="https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html" shash="oKne11hHUhUv1Ux+D9kgS9epPH1xWB0TI25yTBd0L4NOHfRBMy2+aJHs6rm2DNESeRYRUFw6LyClIuNSy4KENw9yKX/ow/F9jAkorj3zgD/SoAcwfh0mZ3vR+WKI5BzRxpAgn2MHn/Lw3EsYBIDPc5qgEWwmE8dcW8ige9YEzXo=" class="moz-txt-link-freetext" moz-do-not-send="true">https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html</a> correctly,
 it seems to me like that is the most likely place for non-deterministic delay to be introduced by argobots in the client -> server direction.<br class="">
<br class="">
I just ran a quick test where I changed the number of handler threads to 5, and I saw no change in behavior (still 4 and 8, not 5 and 10).<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 4, 2021, at 9:04 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov">
<carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
I have some more interesting info to share from trying a few different configurations.<br class="">
<br class="">
sm (on my laptop) and ofi+gni (on theta) do not exhibit this behavior; they have consistent performance across RPCs.<br class="">
<br class="">
ofi+verbs (on cooley) shows the same thing you were seeing; the 4th and 8th RPCs are slow.<br class="">
<br class="">
Based on the above, it would sound like a problem with the libfabric/verbs path.  But on Jerome's suggestion I tried some other supported transports on cooley as well.  In particular I ran the same benchmark (the same build in fact, I just compiled in support
 for multiple transports and cycling through them in a single job script with runtime options) with these combinations:<br class="">
<br class="">
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>• ofi+verbs<br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>• ofi+tcp<br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>• ofi+sockets<br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>• bmi+tcp<br class="">
</div>
All of them show the same thing!  4th and 8th RPCs are at least an order of magnitude slower than the other RPCs.  That was a surprising result.  The bmi+tcp one isn't even using libfabric at all, even though they are all using the same underlying hardware.<br class="">
<br class="">
I'm not sure what to make of that yet.  Possibly something with threading or signalling?<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/2/21 2:37 PM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:82DF3EFD-9A7E-414C-86F7-1894000C30E1@rutgers.edu" class="">
I’m glad you were able to reproduce it on a different system, thanks for letting me know. I’m not sure what the overlaps between Frontera and Cooley are (that aren’t shared by Summit); a quick look shows they are both Intel, and both FDR, but there’s probably
 more salient details.<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 2, 2021, at 2:24 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov">
<carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Ok.  Interesting.  I didn't see anything unusual in the timing on my laptop with sm (other than it being a bit slow, but I wasn't tuning or worrying about core affinity or anything).  On Cooley, a somewhat older Linux cluster with InfiniBand, I see the 4th
 and 8th RPC delay you were talking about:<br class="">
<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent GUID":0}},<br class="">
<br class="">
(assuming the first is high due to connection establishment)<br class="">
<br class="">
I'll check some other systems/transports, but I wanted to go ahead and share that I've been able to reproduce what you were seeing.<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/2/21 1:49 PM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:35E29989-49C6-4AA0-8C41-54D97F1ACBF6@rutgers.edu" class="">
Glad that’s working now.<br class="">
<br class="">
It is the put_wait events, and “dur” is the right field. Those units are microseconds.<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 2, 2021, at 1:12 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov">
<carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Thanks Philip, the "= {0};" initialization of that struct got me going.<br class="">
<br class="">
I can run the test case now and it is producing output in the client and server perf dirs.  Just to sanity check what to look for, I think the problem should be exhibited in the "put_wait" or maybe "do_put" trace events on the client?  For example on my laptop
 I see this:<br class="">
<br class="">
carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep do_put trace_events.0.json<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent GUID":0}},<br class="">
carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait trace_events.0.json<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent GUID":0}},<br class="">
<br class="">
I should look at the "dur" field right?  What are the units on that?<br class="">
<br class="">
I'll see if I can run this on a "real" system shortly.<br class="">
<br class="">
thanks!<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/2/21 12:11 PM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:104B5045-328B-48EE-A32C-09D25817EED7@rutgers.edu" class="">
Hi Phil,<br class="">
<br class="">
Sorry the data structures are like that; I wanted to preserve as much of the RPC size and ordering in case it ended up being important.<br class="">
<br class="">
I’m surprised in.odsc.size is troublesome, as I set in.odsc.size with the line `in.odsc.size = sizeof(odsc);`. I’m not sure what could be corrupting that value in the meantime.<br class="">
<br class="">
I don’t set in.odsc.gdim_size (which was an oversight, since that’s non-zero normally), so I’m less surprised that’s an issue. I thought I initialized `in` to zero, but I see I didn’t do that after all.<br class="">
<br class="">
Maybe change the line `bulk_gdim_t in;` to `bulk_gdim_t in = {0};`<br class="">
<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 2, 2021, at 11:48 AM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Awesome, thanks Philip.  It came through fine.<br class="">
<br class="">
I started by modifying the job script slightly to just run it on my laptop with sm (I wanted to make sure I understood the test case, and how to use apex, before trying elsewhere).  Does in.size needs to be set in client.c?  For me there is a random value in
 that field and it is causing the                                                           encoder on the forward to attempt a very large allocation.  The same might be true of gdim_size if it got past that step.  I started to alter them but then I wasn't sure
 what the implications were.<br class="">
<br class="">
(fwiw I needed to include stdlib.h in common.h, but I've hit that a couple of times recently on codes that didn't previously generate warnings; I think something in Ubuntu has gotten strict about that recently).<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
<br class="">
<br class="">
On 11/1/21 4:51 PM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:67D76603-B4B9-48E5-9452-966D1E4866D2@rutgers.edu" class="">
Hi Phil,<br class="">
<br class="">
I’ve attached the reproducer. I see the 4th and 8th issue on Frontera, but not Summit. Hopefully it will build and run without too much modification. Let me know if there are any issues with running it (or if the anl listserv eats the tarball, which I kind of
 expect).<br class="">
<br class="">
Thanks,<br class="">
Philip<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 1, 2021, at 11:14 AM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Hi Philip,<br class="">
<br class="">
(FYI I think the first image didn't come through in your email, but I think the others are sufficient to get across what you are seeing)<br class="">
<br class="">
I don't have any idea what would cause that.  The recently released libfabric 1.13.2 (available in spack from the mochi-spack-packages repo) includes some fixes to the rxm provider that could be relevant to Frontera and Summit, but nothing that aligns with what
 you are observing.<br class="">
<br class="">
If it were later in the sequence (much later) I would speculate that memory allocation/deallocation cycles were eventually causing a hiccup.  We've seen something like that in the past, and it's a theory that we could then test with alternative allocators like jemalloc.  That's
 not memory allocation jitter that early in the run though.<br class="">
<br class="">
Please do share your reproducer if you don't mind!  We can try a few systems here to at least isolate if it is something peculiar to the InfiniBand path or if there is a more general problem in Margo.<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 10/29/21 3:20 PM, Philip Davis wrote:<br class="">
<blockquote type="cite" class="">Hello,<br class="">
<br class="">
I apologize in advance for the winding nature of this email. I’m not sure how to ask my question without explaining the story of my results some.<br class="">
<br class="">
I’m doing some characterization of our server performance under load, and I have a quirk of performance that I wanted to run by people to see if they make sense. My testing so far has been to iteratively send batches of RPCs using margo_iforward, and then measure
 the wait time until they are all complete. On the server side, handling the RPC includes a margo_bulk_transfer as a pull initiated on the server to pull (for now) 8 bytes. The payload of the RPC request is about 500 bytes, and the response payload is 4 bytes.<br class="">
<br class="">
I’ve isolated my results down to one server rank and one client rank, because it’s an easier starting point to reason from. Below is a graph of some of my initial results. These results are from Frontera. The median times are good (a single RPC takes on the order
 of 10s                                                           of microseconds, which seems fantastic). However, the outliers are fairly high (note the log scale of the y-axis). With only one RPC per timestep, for example, there is a 30x spread between the
 median and the max.<br class="">
<br class="">
<img id="x_126DD6C1-A420-4D70-8D1C-96320B7F54E7" src="cid:5EB5DF63-97AA-48FC-8BBE-E666E933D79F" class="" moz-do-not-send="true"><br class="">
<br class="">
I was hoping (expecting) the first timestep would be where the long replies resided, but that turned out not to be the case. Below are traces from the 1 RPC (blue) and 2 RPC  (orange) per timestep cases, 5 trials of 10 timesteps for each case (normalized to fix
 the same                                                           x-axis):<br class="">
<br class="">
<PastedGraphic-6.png><br class="">
<br class="">
What strikes me is how consistent these results are across trials. For the 1 RPC per timestep case, the 3rd and 7th timestep are consistently slow (and the rest are fast). For the 2 RPC per timestep case, the 2nd and 4th timestep are always slow and sometimes
 the                                                           10th is. These results are repeatable with very rare variation.<br class="">
<br class="">
For the single RPC case, I recorded some timers on the server side, and attempted to overlay them with the client side (there is some unknown offset, but probably on the order of 10s of microseconds at worst, given the pattern):<br class="">
<br class="">
<PastedGraphic-7.png><br class="">
<br class="">
I blew up the first few timesteps of one of the trials:<br class="">
<PastedGraphic-8.png><br class="">
<br class="">
The different colors are different segments of the handler, but there doesn’t seem to be anything too interesting going on inside the handler. So it looks like the time is being introduced before the 3rd RPC handler starts, based on the where the gap appears
 on the server side.<br class="">
<br class="">
To try and isolate any dataspaces-specific behavior, I created a pure Margo test case that just sends a single rpc of the same size as dataspaces iteratively, whre the server side does an 8-byte bulk transfer initiated by the server, and sends a response. The results
 are similar, except that it is now the 4th and 8th timestep that are slow (and the first timestep is VERY long, presumably because rxm communication state is being established. DataSpaces has an earlier RPC in its init that was absorbing this latency).<br class="">
<br class="">
I got margo profiling results for this test case:<br class="">
<br class="">
```<br class="">
3<br class="">
18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573<br class="">
0xa2a1,term_rpc<br class="">
0x27b5,put_rpc<br class="">
0xd320,__shutdown__<br class="">
0x27b5 ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
0x27b5 ,0;0.041241646,200.000000000, 0;<br class="">
0xa2a1 ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
0xa2a1 ,0;0.000009298,1.000000000, 0;<br class="">
```<br class="">
<br class="">
So I guess my question at this point is, is there any sensible reason why the 4th and 8th RPC sent would have a long response time? I think I’ve cleared my code on the client side and server side, so it appears to be latency being introduced by Margo, LibFabric, Argobots,
 or the underlying OS. I do see long timesteps occasionally after this (perhaps every 20-30 timesteps) but these are not consistent.<br class="">
<br class="">
One last detail: this does not happen on Summit. On summit, I see about 5-7x worse single-RPC performance (250-350 microseconds per RPC), but without the intermittent long timesteps.<br class="">
<br class="">
I can provide the minimal test case if it would be helpful. I am using APEX for timing results, and the following dependencies with Spack:<br class="">
<br class="">
<a class="moz-txt-link-abbreviated" href="mailto:argobots@1.1">argobots@1.1</a>  <a class="moz-txt-link-abbreviated" href="mailto:json-c@0.15">json-c@0.15</a>  <a class="moz-txt-link-abbreviated" href="mailto:libfabric@1.13.1">libfabric@1.13.1</a>  <a class="moz-txt-link-abbreviated" href="mailto:mercury@2.0.1">mercury@2.0.1</a>  <a class="moz-txt-link-abbreviated" href="mailto:mochi-margo@0.9.5">mochi-margo@0.9.5</a>  rdma-core@20<br class="">
<br class="">
Thanks,<br class="">
Philip<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
<div class="">_______________________________________________</div>
<div class="">mochi-devel mailing list</div>
<br class="Apple-interchange-newline">
<a class="moz-txt-link-abbreviated" href="mailto:mochi-devel@lists.mcs.anl.gov">mochi-devel@lists.mcs.anl.gov</a><br class="">
<a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=EiX9%2B4KlSbKB7w2ov4OZtE7jCugYdchESAfX0D%2B%2FK7c%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="hT2kkjAIf/zHJhvtfpngbaSAG6BZQ0v653UkY5DV3lvADy7nxpFwHs9680ZLFYC4A/ZkOaDhVaoVT0dhguoFth03P3pU/ankBuXMiNnP4KyfpClP97DwHHcU9nwM04kJb9eamI6B/jo5QLEuZAQ97t2SUt+s9UTzpXLcXc/6/7I=">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
<a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vyMI4AQ%2BbXY4WVJocNVN%2BYJp6H1tYsP7meHgbjriejE%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="ncm2uVm0Ugenklbk5rqBtY39R8xrytsxPVYOdy2NqnuC3nOSGasq7//KT8rW9cbT43LvCOivxAxgLyWaANsZOv0dwLdcqrgqnldIxtfDSfJzCXFTgmgI++3ODNu7UIcNUeMtmBw8YYusppzZlNxvPHAunqN44IwZIk9lYE5wtW0=">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
</blockquote>
_______________________________________________<br class="">
mochi-devel mailing list<br class="">
<a class="moz-txt-link-abbreviated" href="mailto:mochi-devel@lists.mcs.anl.gov">mochi-devel@lists.mcs.anl.gov</a><br class="">
<a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411624266%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oVLJaNkVau19KFuQsfeEpIkYUBvfpHcNBiMwy39eYiE%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="eo+nqLYWCkMMdToapGKW5gCChdEC+hnMkybaq9TTMc+BLlej8fWIrt1WeKWex1ubGz/Kolyb76ecsFMW3CL4Je2n/gnoO704YF3Or7fc73FFazgqUg3EPFQRAx3+zz5WR7a1+QiDa2HcoGRw4/KG7diKB6bVs37W+tX6YsHhBJU=">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
<a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411634259%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XjXCEsV2TNbLfPuH7iwMuolSCtaxnJZksVRu7V%2FfGNw%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="euRQ8SoMMl/IdjKCp7PxhwDgnaOLyG1nlSdl2QPAU17a7nYFM9JBPnoh1O/jSGuGv0+KkUsZMdBbbwg0ENFlNOUiYCxv1aPqeW0whx609AFP7y+OEMFUuRm9Ir3OvZlY4jzkE5B2v68VbQSLyROuz/2i8WR42H/ni5Na2Gn9noc=">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</span></blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</body>
</html>