<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>And one last follow up here (thanks to a ton of help off-list
from Philip!). The core problem is a bad interaction between the
Argobots memory allocator, the stack size being used by Margo, and
transparent huge page (THP) handling on some systems. It can
result in newly allocated memory taking several milliseconds, at
very inconvenient times, to page in for use.<br>
</p>
<p>You can work around it on affected systems by using
prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0) to disable automatic THP
activation.</p>
<p>We'll pursue a longer term solution in the form of a custom stack
allocator for Margo; that effort will be tracked on github at
<a class="moz-txt-link-freetext" href="https://github.com/mochi-hpc/mochi-margo/issues/166">https://github.com/mochi-hpc/mochi-margo/issues/166</a>.</p>
<p>thanks,</p>
<p>-Phil<br>
</p>
<div class="moz-cite-prefix">On 11/10/21 7:45 PM, Phil Carns wrote:<br>
</div>
<blockquote type="cite" cite="mid:0d9a618a-6775-7100-ca16-6ff22529956a@mcs.anl.gov">
<p>Just to follow up for folks on the list: we've made some
headway on isolating the source of the delays. It isn't network
specific, but rather an issue in the Argobots ULT creation path
on some platforms. It's being tracked now at <a class="moz-txt-link-freetext" href="https://github.com/pmodels/argobots/issues/369" moz-do-not-send="true">https://github.com/pmodels/argobots/issues/369</a>
with a standalone Argobots reproducer. It should be a lot
easier to hunt down from this point.<br>
</p>
<p>Some details:</p>
<ul>
<li>This only happens on particular machines (Frontera@TACC and
Cooley@ALCF for example) but we don't know why just yet.<br>
</li>
<li>It only happens if you use dedicated threads (execution
streams) for handling RPCs. This is common for a lot of Mochi
data services, though</li>
<li>The test case also happens to highlight that we have some
memory allocation/caching tuning work to do even on platforms
that behave correctly. We have a lot of options on that
front, but we'll circle back to it later once the major
platform-specific performance bug is resolved.</li>
</ul>
<p>thanks,</p>
<p>-Phil<br>
</p>
<blockquote type="cite" cite="mid:7B252C38-02B7-40EE-A799-7856565F7C0E@rutgers.edu"> One
more thing: I tried the same test case with a non-trivial size
for the bulk pull (8MB) just to shift the timing some. I still
see the extra time appear on the 4th and 8th timestep. I had
wondered if the issue was that the timestep period was small
enough that some background task was not completing in time, but
that does not seem to be the case:
<div class=""><br class="">
<div class="">
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024255773.000000,
"dur": 76836.500000,"args":{"GUID":3,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024332702.500000,
"dur": 3420.500000,"args":{"GUID":5,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024336178.250000,
"dur": 3369.000000,"args":{"GUID":7,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024339577.500000,
"dur": 5052.000000,"args":{"GUID":9,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024344658.750000,
"dur": 3326.750000,"args":{"GUID":11,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024348016.500000,
"dur": 3697.000000,"args":{"GUID":13,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024351743.500000,
"dur": 3333.250000,"args":{"GUID":15,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024355106.000000,
"dur": 4500.000000,"args":{"GUID":17,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024359634.500000,
"dur": 3336.250000,"args":{"GUID":19,"Parent GUID":0}},</div>
<div class="">{"name":"put_wait","ph":"X","pid":0,"tid":0,"ts":1636138024362999.000000,
"dur": 3296.000000,"args":{"GUID":21,"Parent GUID":0}},</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Nov 5, 2021, at 12:40 PM, Phil Carns
<<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div class="">
<p class=""><br class="">
</p>
<div class="moz-cite-prefix">On 11/5/21 12:23 PM,
Philip Davis wrote:<br class="">
</div>
<blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class=""> That’s a good find. I’m reading through
the code and documentation, and I’m having a
little trouble understanding what the difference
between 0 and -1 for that last argument is when
the second to last argument is 0. I see in
documentation:
<div class=""><br class="">
</div>
<div class=""><span style="caret-color: rgb(64,
64, 64); color: rgb(64, 64, 64); font-family:
Lato, proxima-nova, "Helvetica
Neue", Arial, sans-serif; font-size:
16px; background-color: rgb(252, 252, 252);" class="">The third argument indicates whether
an Argobots execution stream (ES) should be
created to run the Mercury progress loop. If
this argument is set to 0, the progress loop
is going to run in the context of the main ES
(this should be the standard scenario, unless
you have a good reason for not using the main
ES, such as the main ES using MPI primitives
that could block the progress loop). A value
of 1 will make Margo create an ES to run the
Mercury progress loop. The fourth argument is
the number of ES to create and use for
executing RPC handlers. A value of 0 will make
Margo execute RPCs in the ES that called </span><code class="code literal docutils notranslate" style="box-sizing: border-box; font-family:
SFMono-Regular, Menlo, Monaco, Consolas,
"Liberation Mono", "Courier
New", Courier, monospace; white-space:
nowrap; max-width: 100%; background-color:
rgb(255, 255, 255); border: 1px solid rgb(225,
228, 229); padding: 2px 5px; color: rgb(231,
76, 60); overflow-x: auto;"><span class="pre" style="box-sizing: border-box;">margo_init</span></code><span style="caret-color: rgb(64, 64, 64); color:
rgb(64, 64, 64); font-family: Lato,
proxima-nova, "Helvetica Neue",
Arial, sans-serif; font-size: 16px;
background-color: rgb(252, 252, 252);" class="">. A value of -1 will make Margo
execute the RPCs in the ES running the
progress loop. A positive value will make
Margo create new ESs to run the RPCs.</span></div>
<div class=""><font class="" size="3" face="Lato,
proxima-nova, Helvetica Neue, Arial,
sans-serif" color="#404040"><span style="caret-color: rgb(64, 64, 64);
background-color: rgb(252, 252, 252);" class=""><br class="">
</span></font></div>
<span class="">What is the difference between the
‘main ES’ (last two arguments 0,-1) and the 'ES
that called margo_init’ (last two arguments
0,0) in the absence of me creating new execution
streams? Or maybe I’m not interpreting the
documentation correctly?<br class="">
</span></blockquote>
<p class=""><br class="">
</p>
<p class="">You are right, those are equivalent when
the next to last argument is 0 :) The main ES and
progress thread ES are one and the same in that
case, so the RPCs go to the same place either way.</p>
<p class="">I've narrowed down a little further and
found that the stalls occur when there are 2
dedicated RPC ESs but not when there is just 1
dedicated RPC ES. That isolates the problem
slightly further, in that it's not just some cost
associated with relaying ULTs to another thread
(that's happening every time in the 1 handler case
too), but something that only happens when
multiple ESes could potentially service an RPC.</p>
<p class="">thanks,</p>
<p class="">-Phil<br class="">
<span class=""></span></p>
<blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu" class=""> <span class="">
<blockquote type="cite" class="">On Nov 5, 2021,
at 10:29 AM, Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>
wrote:<br class="">
<br class="">
Srinivasan Ramesh (U. Oregon) has done some
work on fine-grained RPC component timing, but
it's not in the mainline margo tree so we'll
need to do a little more work to look at it.<br class="">
<br class="">
In the mean time on a hunch I found that I can
make the latency consistent on Cooley by
altering the margo_init() arguments to
be (..., 0, -1) in server.c (meaning that no
additional execution streams are used at all;
all mercury progress and all rpc handlers are
executed using user-level threads in
the process's primary execution stream (OS
thread).<br class="">
<br class="">
It's expected that there would be some more
jitter jumping across OS threads for RPC
handling, but it shouldn't be that
extreme, regular, or system-specific.<br class="">
<br class="">
Thanks again for the test case and the Apex
instrumentation; this is the sort of thing
that's normally really hard to isolate.<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/5/21 10:09 AM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:B0643541-F015-4EB6-AD30-F9F85B196465@rutgers.edu" class=""> That’s extremely interesting. <br class="">
<br class="">
Are there any internal timers in Margo that
can tell what the delay was between the
server’s progress thread queueing the rpc
and the handler thread starting to handle
it? If I’m understanding <a href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmochi.readthedocs.io%2Fen%2Flatest%2Fgeneral%2F03_rpc_model.html&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411604270%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fgQJROtM5XQkNijao33AfJkl1HDFEX42Q2M0%2BR0POoo%3D&reserved=0" originalsrc="https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html" shash="oKne11hHUhUv1Ux+D9kgS9epPH1xWB0TI25yTBd0L4NOHfRBMy2+aJHs6rm2DNESeRYRUFw6LyClIuNSy4KENw9yKX/ow/F9jAkorj3zgD/SoAcwfh0mZ3vR+WKI5BzRxpAgn2MHn/Lw3EsYBIDPc5qgEWwmE8dcW8ige9YEzXo=" class="moz-txt-link-freetext" moz-do-not-send="true">https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html</a> correctly,
it seems to me like that is the most likely
place for non-deterministic delay to be
introduced by argobots in the client ->
server direction.<br class="">
<br class="">
I just ran a quick test where I changed the
number of handler threads to 5, and I saw no
change in behavior (still 4 and 8, not 5
and 10).<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 4,
2021, at 9:04 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">
<carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
I have some more interesting info to
share from trying a few different
configurations.<br class="">
<br class="">
sm (on my laptop) and ofi+gni (on
theta) do not exhibit this behavior; they
have consistent performance across RPCs.<br class="">
<br class="">
ofi+verbs (on cooley) shows the same
thing you were seeing; the 4th and 8th
RPCs are slow.<br class="">
<br class="">
Based on the above, it would sound like
a problem with the libfabric/verbs path.
But on Jerome's suggestion I tried some
other supported transports on cooley as
well. In particular I ran the same
benchmark (the same build in fact, I
just compiled in support for multiple
transports and cycling through them in a
single job script with runtime options)
with these combinations:<br class="">
<br class="">
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
ofi+verbs<br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
ofi+tcp<br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
ofi+sockets<br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
bmi+tcp<br class="">
</div>
All of them show the same thing! 4th
and 8th RPCs are at least an order of
magnitude slower than the other RPCs.
That was a surprising
result. The bmi+tcp one isn't even using
libfabric at all, even though they are all
using the same underlying hardware.<br class="">
<br class="">
I'm not sure what to make of that
yet. Possibly something with threading or
signalling?<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/2/21 2:37 PM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:82DF3EFD-9A7E-414C-86F7-1894000C30E1@rutgers.edu" class=""> I’m glad you were able to
reproduce it on a different system,
thanks for letting me know. I’m not sure
what the overlaps between Frontera
and Cooley are (that aren’t shared by
Summit); a quick look shows they are
both Intel, and both FDR, but there’s
probably more salient details.<br class="">
<br class="">
<blockquote type="cite" class="">On Nov
2, 2021, at 2:24 PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">
<carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Ok. Interesting. I didn't see
anything unusual in the timing on
my laptop with sm (other than it being
a bit slow, but I wasn't tuning
or worrying about core affinity
or anything). On Cooley, a somewhat
older Linux cluster with InfiniBand, I
see the 4th and 8th RPC delay you were
talking about:<br class="">
<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent GUID":0}},<br class="">
<br class="">
(assuming the first is high due to
connection establishment)<br class="">
<br class="">
I'll check some
other systems/transports, but I wanted
to go ahead and share that I've been
able to reproduce what you were
seeing.<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/2/21 1:49 PM, Philip Davis
wrote:<br class="">
<blockquote type="cite" cite="mid:35E29989-49C6-4AA0-8C41-54D97F1ACBF6@rutgers.edu" class=""> Glad that’s working now.<br class="">
<br class="">
It is the put_wait events, and “dur”
is the right field. Those units
are microseconds.<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On
Nov 2, 2021, at 1:12 PM, Phil
Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true">
<carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Thanks Philip, the "=
{0};" initialization of
that struct got me going.<br class="">
<br class="">
I can run the test case now and it
is producing output in the client
and server perf dirs. Just to
sanity check what to look for,
I think the problem should be
exhibited in the "put_wait" or
maybe "do_put" trace events on the
client? For example on my laptop
I see this:<br class="">
<br class="">
carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep
do_put trace_events.0.json<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent GUID":0}},<br class="">
carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait trace_events.0.json<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent GUID":0}},<br class="">
<br class="">
I should look at the "dur" field
right? What are the units
on that?<br class="">
<br class="">
I'll see if I can run this on a
"real" system shortly.<br class="">
<br class="">
thanks!<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/2/21 12:11 PM, Philip Davis
wrote:<br class="">
<blockquote type="cite" cite="mid:104B5045-328B-48EE-A32C-09D25817EED7@rutgers.edu" class=""> Hi Phil,<br class="">
<br class="">
Sorry the data structures are
like that; I wanted to preserve
as much of the RPC size and
ordering in case it ended up
being important.<br class="">
<br class="">
I’m surprised in.odsc.size is
troublesome, as I set
in.odsc.size with the line
`in.odsc.size = sizeof(odsc);`.
I’m not sure what could
be corrupting that value in the
meantime.<br class="">
<br class="">
I don’t set
in.odsc.gdim_size (which was an
oversight, since that’s
non-zero normally), so I’m
less surprised that’s an issue.
I thought I initialized `in`
to zero, but I see I didn’t do
that after all.<br class="">
<br class="">
Maybe change the line
`bulk_gdim_t in;` to
`bulk_gdim_t in = {0};`<br class="">
<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On
Nov 2, 2021, at 11:48 AM, Phil
Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true"><carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Awesome, thanks Philip.
It came through fine.<br class="">
<br class="">
I started by modifying the job
script slightly to just run it
on my laptop with sm (I
wanted to make sure
I understood the test case,
and how to use apex,
before trying elsewhere). Does
in.size needs to be set
in client.c? For me there is
a random value in that
field and it is causing the
encoder on the forward
to attempt a
very large allocation. The
same might be true
of gdim_size if it got
past that step. I started
to alter them but then I
wasn't sure what
the implications were.<br class="">
<br class="">
(fwiw I needed
to include stdlib.h
in common.h, but I've hit
that a couple of times
recently on codes
that didn't previously generate warnings;
I think something in Ubuntu
has gotten strict about
that recently).<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
<br class="">
<br class="">
On 11/1/21 4:51 PM,
Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:67D76603-B4B9-48E5-9452-966D1E4866D2@rutgers.edu" class=""> Hi Phil,<br class="">
<br class="">
I’ve attached
the reproducer. I see the
4th and 8th issue on
Frontera, but
not Summit. Hopefully
it will build and run
without too
much modification. Let me
know if there are any issues
with running it (or if the
anl listserv eats the
tarball, which I kind of
expect).<br class="">
<br class="">
Thanks,<br class="">
Philip<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 1,
2021, at 11:14 AM, Phil
Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov" moz-do-not-send="true"><carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Hi Philip,<br class="">
<br class="">
(FYI I think the first
image didn't come through
in your email, but I
think the others
are sufficient to get
across what you
are seeing)<br class="">
<br class="">
I don't have any idea
what would cause that.
The recently released libfabric 1.13.2 (available
in spack from
the mochi-spack-packages repo)
includes some fixes to the
rxm provider that could
be relevant to Frontera
and Summit, but nothing
that aligns with what you
are observing.<br class="">
<br class="">
If it were later in
the sequence (much later)
I would speculate
that memory allocation/deallocation cycles
were eventually causing
a hiccup. We've seen
something like that in the
past, and it's a
theory that we could then
test
with alternative allocators like jemalloc. That's
not memory allocation jitter that early in the run though.<br class="">
<br class="">
Please do share
your reproducer if you
don't mind! We can try a
few systems here to at
least isolate if it is
something peculiar
to the InfiniBand path or
if there is a more
general problem in Margo.<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 10/29/21 3:20 PM,
Philip Davis wrote:<br class="">
<blockquote type="cite" class="">Hello,<br class="">
<br class="">
I apologize in advance
for the winding nature
of this email. I’m
not sure how to ask
my question without explaining
the story of my results
some.<br class="">
<br class="">
I’m doing
some characterization of
our server performance
under load, and I have
a quirk
of performance that I
wanted to run
by people to see if they
make sense. My testing
so far has been
to iteratively send
batches of RPCs
using margo_iforward,
and then measure
the wait time until they
are all complete. On the
server side,
handling the
RPC includes
a margo_bulk_transfer as
a pull initiated
on the server to pull
(for now) 8 bytes.
The payload of the RPC
request is about
500 bytes, and
the response payload is
4 bytes.<br class="">
<br class="">
I’ve isolated my results
down to one server rank
and one client
rank, because it’s an
easier starting point to
reason from. Below
is a graph of some of
my initial results.
These results are from
Frontera. The
median times are good (a
single RPC takes on
the order of 10s
of microseconds, which
seems fantastic). However, the outliers are fairly high (note the
log scale of
the y-axis). With only
one
RPC per timestep, for
example, there is a
30x spread between the
median and the max.<br class="">
<br class="">
<img id="x_126DD6C1-A420-4D70-8D1C-96320B7F54E7" src="cid:5EB5DF63-97AA-48FC-8BBE-E666E933D79F" class="" moz-do-not-send="true"><br class="">
<br class="">
I was
hoping (expecting) the
first timestep would be
where the long
replies resided,
but that turned out not
to be the case. Below
are traces from the 1
RPC (blue) and 2 RPC
(orange) per
timestep cases, 5 trials
of 10 timesteps for each
case (normalized to fix
the same
x-axis):<br class="">
<br class="">
<PastedGraphic-6.png><br class="">
<br class="">
What strikes me is how
consistent these
results are
across trials. For the 1
RPC per timestep
case, the 3rd and 7th
timestep are consistently slow (and
the rest are fast).
For the 2 RPC
per timestep case, the
2nd and 4th timestep are
always slow
and sometimes the
10th is.
These results
are repeatable with very
rare variation.<br class="">
<br class="">
For the single RPC case,
I recorded some timers
on the server side, and
attempted to
overlay them with
the client side (there
is some unknown offset,
but probably on the
order of 10s
of microseconds at
worst, given
the pattern):<br class="">
<br class="">
<PastedGraphic-7.png><br class="">
<br class="">
I blew up the first
few timesteps of one of
the trials:<br class="">
<PastedGraphic-8.png><br class="">
<br class="">
The different colors
are different segments
of the handler, but
there doesn’t seem to be
anything too interesting going on inside the handler. So it looks like
the time is
being introduced before
the 3rd RPC
handler starts, based on
the where the
gap appears on
the server side.<br class="">
<br class="">
To try and isolate
any dataspaces-specific behavior,
I created a pure Margo
test case that
just sends a single rpc
of the same size
as dataspaces iteratively, whre
the server side does an
8-byte bulk
transfer initiated
by the server, and sends
a response. The results
are similar, except that
it is now the 4th and
8th timestep that are
slow (and the
first timestep is VERY
long, presumably because
rxm communication state
is
being established. DataSpaces has an earlier RPC in its init that
was absorbing
this latency).<br class="">
<br class="">
I got
margo profiling results
for this test case:<br class="">
<br class="">
```<br class="">
3<br class="">
18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573<br class="">
0xa2a1,term_rpc<br class="">
0x27b5,put_rpc<br class="">
0xd320,__shutdown__<br class="">
0x27b5 ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
0x27b5 ,0;0.041241646,200.000000000, 0;<br class="">
0xa2a1 ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
0xa2a1 ,0;0.000009298,1.000000000, 0;<br class="">
```<br class="">
<br class="">
So I guess my question
at this point is, is
there
any sensible reason why
the 4th and 8th RPC sent
would have a
long response
time? I think
I’ve cleared my code on
the client side and
server side, so
it appears to be latency
being introduced
by Margo, LibFabric, Argobots,
or the underlying OS. I
do see long
timesteps occasionally after
this (perhaps
every 20-30 timesteps)
but these are
not consistent.<br class="">
<br class="">
One last detail: this
does not happen
on Summit. On summit, I
see about
5-7x worse single-RPC performance (250-350 microseconds per
RPC), but without
the intermittent long timesteps.<br class="">
<br class="">
I can provide the
minimal test case if it
would be helpful. I
am using APEX
for timing results,
and the
following dependencies with
Spack:<br class="">
<br class="">
<a class="moz-txt-link-abbreviated
moz-txt-link-freetext" href="mailto:argobots@1.1" moz-do-not-send="true">argobots@1.1</a> <a class="moz-txt-link-abbreviated
moz-txt-link-freetext" href="mailto:json-c@0.15" moz-do-not-send="true">json-c@0.15</a> <a class="moz-txt-link-abbreviated
moz-txt-link-freetext" href="mailto:libfabric@1.13.1" moz-do-not-send="true">libfabric@1.13.1</a> <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:mercury@2.0.1" moz-do-not-send="true">mercury@2.0.1</a> <a class="moz-txt-link-abbreviated moz-txt-link-freetext" href="mailto:mochi-margo@0.9.5" moz-do-not-send="true">mochi-margo@0.9.5</a> rdma-core@20<br class="">
<br class="">
Thanks,<br class="">
Philip<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
<div class="">_______________________________________________</div>
<div class="">mochi-devel
mailing list</div>
<br class="Apple-interchange-newline">
<a class="moz-txt-link-abbreviated
moz-txt-link-freetext" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">
<a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=EiX9%2B4KlSbKB7w2ov4OZtE7jCugYdchESAfX0D%2B%2FK7c%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="hT2kkjAIf/zHJhvtfpngbaSAG6BZQ0v653UkY5DV3lvADy7nxpFwHs9680ZLFYC4A/ZkOaDhVaoVT0dhguoFth03P3pU/ankBuXMiNnP4KyfpClP97DwHHcU9nwM04kJb9eamI6B/jo5QLEuZAQ97t2SUt+s9UTzpXLcXc/6/7I=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
<a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411614260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vyMI4AQ%2BbXY4WVJocNVN%2BYJp6H1tYsP7meHgbjriejE%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="ncm2uVm0Ugenklbk5rqBtY39R8xrytsxPVYOdy2NqnuC3nOSGasq7//KT8rW9cbT43LvCOivxAxgLyWaANsZOv0dwLdcqrgqnldIxtfDSfJzCXFTgmgI++3ODNu7UIcNUeMtmBw8YYusppzZlNxvPHAunqN44IwZIk9lYE5wtW0=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
</blockquote>
_______________________________________________<br class="">
mochi-devel mailing list<br class="">
<a class="moz-txt-link-abbreviated
moz-txt-link-freetext" href="mailto:mochi-devel@lists.mcs.anl.gov" moz-do-not-send="true">mochi-devel@lists.mcs.anl.gov</a><br class="">
<a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411624266%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oVLJaNkVau19KFuQsfeEpIkYUBvfpHcNBiMwy39eYiE%3D&reserved=0" originalsrc="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel" shash="eo+nqLYWCkMMdToapGKW5gCChdEC+hnMkybaq9TTMc+BLlej8fWIrt1WeKWex1ubGz/Kolyb76ecsFMW3CL4Je2n/gnoO704YF3Or7fc73FFazgqUg3EPFQRAx3+zz5WR7a1+QiDa2HcoGRw4/KG7diKB6bVs37W+tX6YsHhBJU=" moz-do-not-send="true">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
<a class="moz-txt-link-freetext" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C061bd8118521465c017008d9a07aee25%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637717273411634259%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XjXCEsV2TNbLfPuH7iwMuolSCtaxnJZksVRu7V%2FfGNw%3D&reserved=0" originalsrc="https://www.mcs.anl.gov/research/projects/mochi" shash="euRQ8SoMMl/IdjKCp7PxhwDgnaOLyG1nlSdl2QPAU17a7nYFM9JBPnoh1O/jSGuGv0+KkUsZMdBbbwg0ENFlNOUiYCxv1aPqeW0whx609AFP7y+OEMFUuRm9Ir3OvZlY4jzkE5B2v68VbQSLyROuz/2i8WR42H/ni5Na2Gn9noc=" moz-do-not-send="true">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</span></blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote>
</blockquote>
</body>
</html>