<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 11/5/21 12:23 PM, Philip Davis
wrote:<br>
</div>
<blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu">
That’s a good find. I’m reading through the code and
documentation, and I’m having a little trouble understanding what
the difference between 0 and -1 for that last argument is when the
second to last argument is 0. I see in documentation:
<div class=""><br class="">
</div>
<div class=""><span style="caret-color: rgb(64, 64, 64); color:
rgb(64, 64, 64); font-family: Lato, proxima-nova,
"Helvetica Neue", Arial, sans-serif; font-size:
16px; background-color: rgb(252, 252, 252);" class="">The
third argument indicates whether an Argobots execution stream
(ES) should be created to run the Mercury progress loop. If
this argument is set to 0, the progress loop is going to run
in the context of the main ES (this should be the standard
scenario, unless you have a good reason for not using the main
ES, such as the main ES using MPI primitives that could block
the progress loop). A value of 1 will make Margo create an ES
to run the Mercury progress loop. The fourth argument is the
number of ES to create and use for executing RPC handlers. A
value of 0 will make Margo execute RPCs in the ES that called </span><code class="code literal docutils notranslate" style="box-sizing:
border-box; font-family: SFMono-Regular, Menlo, Monaco,
Consolas, "Liberation Mono", "Courier
New", Courier, monospace; white-space: nowrap; max-width:
100%; background-color: rgb(255, 255, 255); border: 1px solid
rgb(225, 228, 229); padding: 2px 5px; color: rgb(231, 76, 60);
overflow-x: auto;"><span class="pre" style="box-sizing:
border-box;">margo_init</span></code><span style="caret-color: rgb(64, 64, 64); color: rgb(64, 64, 64);
font-family: Lato, proxima-nova, "Helvetica Neue",
Arial, sans-serif; font-size: 16px; background-color: rgb(252,
252, 252);" class="">. A value of -1 will make Margo execute
the RPCs in the ES running the progress loop. A positive value
will make Margo create new ESs to run the RPCs.</span></div>
<div class=""><font class="" size="3" face="Lato, proxima-nova,
Helvetica Neue, Arial, sans-serif" color="#404040"><span style="caret-color: rgb(64, 64, 64); background-color:
rgb(252, 252, 252);" class=""><br class="">
</span></font></div>
<span class="">What is the difference between the ‘main ES’ (last
two arguments 0,-1) and the 'ES that called margo_init’ (last
two arguments 0,0) in the absence of me creating new execution
streams? Or maybe I’m not interpreting the documentation
correctly?<br class="">
</span></blockquote>
<p><br>
</p>
<p>You are right, those are equivalent when the next to last
argument is 0 :) The main ES and progress thread ES are one and
the same in that case, so the RPCs go to the same place either
way.</p>
<p>I've narrowed down a little further and found that the stalls
occur when there are 2 dedicated RPC ESs but not when there is
just 1 dedicated RPC ES. That isolates the problem slightly
further, in that it's not just some cost associated with relaying
ULTs to another thread (that's happening every time in the 1
handler case too), but something that only happens when multiple
ESes could potentially service an RPC.</p>
<p>thanks,</p>
<p>-Phil<br>
<span class=""></span></p>
<blockquote type="cite" cite="mid:FC5B09A2-9427-4737-8EDB-826494B49D22@rutgers.edu"><span class="">
<blockquote type="cite" class="">On Nov 5, 2021, at 10:29 AM,
Phil Carns <<a href="mailto:carns@mcs.anl.gov" class="moz-txt-link-freetext" moz-do-not-send="true">carns@mcs.anl.gov</a>>
wrote:<br class="">
<br class="">
Srinivasan Ramesh (U. Oregon) has done some work on
fine-grained RPC component timing, but it's not in the
mainline margo tree so we'll need to do a little more work to
look at it.<br class="">
<br class="">
In the mean time on a hunch I found that I can make the
latency consistent on Cooley by altering the margo_init()
arguments to be (..., 0, -1) in server.c (meaning that no
additional execution streams are used at all; all mercury
progress and all rpc handlers are executed using user-level
threads in the process's primary execution stream (OS thread).<br class="">
<br class="">
It's expected that there would be some more jitter jumping
across OS threads for RPC handling, but it shouldn't be that
extreme, regular, or system-specific.<br class="">
<br class="">
Thanks again for the test case and the Apex instrumentation;
this is the sort of thing that's normally really hard to
isolate.<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/5/21 10:09 AM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:B0643541-F015-4EB6-AD30-F9F85B196465@rutgers.edu" class="">
That’s extremely interesting. <br class="">
<br class="">
Are there any internal timers in Margo that can tell what
the delay was between the server’s progress thread
queueing the rpc and the handler thread starting to handle
it? If I’m understanding <a href="https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html" class="moz-txt-link-freetext" moz-do-not-send="true">https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html</a> correctly,
it seems to me like that is the most likely place for
non-deterministic delay to be introduced by argobots in the
client -> server direction.<br class="">
<br class="">
I just ran a quick test where I changed the number of
handler threads to 5, and I saw no change in behavior (still
4 and 8, not 5 and 10).<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 4, 2021, at 9:04 PM,
Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
I have some more interesting info to share from trying a
few different configurations.<br class="">
<br class="">
sm (on my laptop) and ofi+gni (on theta) do not exhibit
this behavior; they have consistent performance across
RPCs.<br class="">
<br class="">
ofi+verbs (on cooley) shows the same thing you were
seeing; the 4th and 8th RPCs are slow.<br class="">
<br class="">
Based on the above, it would sound like a problem with the
libfabric/verbs path. But on Jerome's suggestion I tried
some other supported transports on cooley as well. In
particular I ran the same benchmark (the same build in
fact, I just compiled in support for multiple transports
and cycling through them in a single job script
with runtime options) with these combinations:<br class="">
<br class="">
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
ofi+verbs<br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
ofi+tcp<br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
ofi+sockets<br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>•
bmi+tcp<br class="">
</div>
All of them show the same thing! 4th and 8th RPCs are at
least an order of magnitude slower than the other RPCs.
That was a surprising result. The bmi+tcp one isn't even
using libfabric at all, even though they are all using the
same underlying hardware.<br class="">
<br class="">
I'm not sure what to make of that yet. Possibly something
with threading or signalling?<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/2/21 2:37 PM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:82DF3EFD-9A7E-414C-86F7-1894000C30E1@rutgers.edu" class="">
I’m glad you were able to reproduce it on a different
system, thanks for letting me know. I’m not sure what
the overlaps between Frontera and Cooley are (that
aren’t shared by Summit); a quick look shows they are
both Intel, and both FDR, but there’s probably more
salient details.<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 2, 2021, at 2:24
PM, Phil Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Ok. Interesting. I didn't see anything unusual in
the timing on my laptop with sm (other than it being
a bit slow, but I wasn't tuning or worrying about core
affinity or anything). On Cooley, a somewhat
older Linux cluster with InfiniBand, I see the 4th and
8th RPC delay you were talking about:<br class="">
<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent GUID":0}},<br class="">
<br class="">
(assuming the first is high due to connection
establishment)<br class="">
<br class="">
I'll check some other systems/transports, but I wanted
to go ahead and share that I've been able to reproduce
what you were seeing.<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/2/21 1:49 PM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:35E29989-49C6-4AA0-8C41-54D97F1ACBF6@rutgers.edu" class="">
Glad that’s working now.<br class="">
<br class="">
It is the put_wait events, and “dur” is the right
field. Those units are microseconds.<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 2, 2021,
at 1:12 PM, Phil Carns
<a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Thanks Philip, the "= {0};" initialization of
that struct got me going.<br class="">
<br class="">
I can run the test case now and it is producing
output in the client and server perf dirs. Just
to sanity check what to look for, I think the
problem should be exhibited in the "put_wait" or
maybe "do_put" trace events on the client? For
example on my laptop I see this:<br class="">
<br class="">
carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]>
grep do_put trace_events.0.json<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent GUID":0}},<br class="">
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent GUID":0}},<br class="">
carns-x1-7g ~/w/d/d/m/client.perf> grep
put_wait trace_events.0.json<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent GUID":0}},<br class="">
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent GUID":0}},<br class="">
<br class="">
I should look at the "dur" field right? What are
the units on that?<br class="">
<br class="">
I'll see if I can run this on a "real" system
shortly.<br class="">
<br class="">
thanks!<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 11/2/21 12:11 PM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:104B5045-328B-48EE-A32C-09D25817EED7@rutgers.edu" class="">
Hi Phil,<br class="">
<br class="">
Sorry the data structures are like that; I
wanted to preserve as much of the RPC size and
ordering in case it ended up being important.<br class="">
<br class="">
I’m surprised in.odsc.size is troublesome, as
I set in.odsc.size with the line `in.odsc.size
= sizeof(odsc);`. I’m not sure what could
be corrupting that value in the meantime.<br class="">
<br class="">
I don’t set in.odsc.gdim_size (which was an
oversight, since that’s non-zero normally), so
I’m less surprised that’s an issue. I thought
I initialized `in` to zero, but I see I
didn’t do that after all.<br class="">
<br class="">
Maybe change the line `bulk_gdim_t in;` to
`bulk_gdim_t in = {0};`<br class="">
<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 2, 2021,
at 11:48 AM, Phil
Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Awesome, thanks Philip. It came through fine.<br class="">
<br class="">
I started by modifying the job script slightly
to just run it on my laptop with sm (I
wanted to make sure I understood the test
case, and how to use apex,
before trying elsewhere). Does in.size needs
to be set in client.c? For me there is
a random value in that field and it is causing
the
encoder on the forward
to attempt a very large allocation. The same
might be true of gdim_size if it got past that
step. I started to alter them but then I
wasn't sure what the implications were.<br class="">
<br class="">
(fwiw I needed to include stdlib.h
in common.h, but I've hit that a couple
of times recently on codes
that didn't previously generate warnings;
I think something in Ubuntu has gotten
strict about that recently).<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
<br class="">
<br class="">
On 11/1/21 4:51 PM, Philip Davis wrote:<br class="">
<blockquote type="cite" cite="mid:67D76603-B4B9-48E5-9452-966D1E4866D2@rutgers.edu" class="">
Hi Phil,<br class="">
<br class="">
I’ve attached the reproducer. I see the
4th and 8th issue on Frontera, but
not Summit. Hopefully it will build and run
without too much modification. Let me know
if there are any issues with running it
(or if the anl listserv eats the
tarball, which I kind of expect).<br class="">
<br class="">
Thanks,<br class="">
Philip<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Nov 1,
2021, at 11:14 AM, Phil
Carns <a class="moz-txt-link-rfc2396E" href="mailto:carns@mcs.anl.gov"><carns@mcs.anl.gov></a> wrote:<br class="">
<br class="">
Hi Philip,<br class="">
<br class="">
(FYI I think the first image didn't
come through in your email, but I
think the others are sufficient to get
across what you are seeing)<br class="">
<br class="">
I don't have any idea what would
cause that.
The recently released libfabric 1.13.2 (available
in spack from
the mochi-spack-packages repo) includes
some fixes to the rxm provider that could
be relevant to Frontera and Summit,
but nothing that aligns with what you
are observing.<br class="">
<br class="">
If it were later in the sequence
(much later) I would speculate
that memory allocation/deallocation cycles
were eventually causing a hiccup.
We've seen something like that in the
past, and it's a theory that we could then
test
with alternative allocators like jemalloc. That's
not memory allocation jitter that early in
the run though.<br class="">
<br class="">
Please do share your reproducer if you
don't mind! We can try a few systems
here to at least isolate if it is
something peculiar to the InfiniBand path
or if there is a more general problem
in Margo.<br class="">
<br class="">
thanks,<br class="">
<br class="">
-Phil<br class="">
<br class="">
On 10/29/21 3:20 PM, Philip Davis wrote:<br class="">
<blockquote type="cite" class="">Hello,<br class="">
<br class="">
I apologize in advance for the
winding nature of this email. I’m
not sure how to ask
my question without explaining the story
of my results some.<br class="">
<br class="">
I’m doing some characterization of our
server performance under load, and I
have a quirk of performance that I
wanted to run by people to see if they
make sense. My testing so far has been
to iteratively send batches of RPCs
using margo_iforward, and then measure
the wait time until they are
all complete. On the server side,
handling the RPC includes
a margo_bulk_transfer as a
pull initiated on the server to pull
(for now) 8 bytes. The payload of
the RPC request is about 500 bytes, and
the response payload is 4 bytes.<br class="">
<br class="">
I’ve isolated my results down to one
server rank and one client rank, because
it’s an easier starting point to
reason from. Below is a graph of some of
my initial results. These results
are from Frontera. The median times are
good (a single RPC takes on the order of
10s
of microseconds, which
seems fantastic). However, the outliers
are fairly high (note the log scale of
the y-axis). With only one
RPC per timestep, for example, there is
a 30x spread between the median and the
max.<br class="">
<br class="">
<img id="x_126DD6C1-A420-4D70-8D1C-96320B7F54E7" src="cid:5EB5DF63-97AA-48FC-8BBE-E666E933D79F" class="" moz-do-not-send="true"><br class="">
<br class="">
I was hoping (expecting) the
first timestep would be where the long
replies resided, but that turned out not
to be the case. Below are traces
from the 1 RPC (blue) and 2 RPC
(orange) per timestep cases, 5 trials
of 10 timesteps for each
case (normalized to fix the same
x-axis):<br class="">
<br class="">
<PastedGraphic-6.png><br class="">
<br class="">
What strikes me is how consistent these
results are across trials. For the 1 RPC
per timestep case, the 3rd and 7th
timestep are consistently slow (and the
rest are fast). For the 2 RPC
per timestep case, the 2nd and 4th
timestep are always slow and sometimes
the
10th is.
These results are repeatable with very
rare variation.<br class="">
<br class="">
For the single RPC case, I recorded
some timers on the server side, and
attempted to overlay them with
the client side (there is
some unknown offset, but probably on the
order of 10s of microseconds at
worst, given the pattern):<br class="">
<br class="">
<PastedGraphic-7.png><br class="">
<br class="">
I blew up the first few timesteps of one
of the trials:<br class="">
<PastedGraphic-8.png><br class="">
<br class="">
The different colors
are different segments of the
handler, but there doesn’t seem to be
anything too interesting going on inside
the handler. So it looks like the time
is being introduced before the 3rd RPC
handler starts, based on the where the
gap appears on the server side.<br class="">
<br class="">
To try and isolate
any dataspaces-specific behavior,
I created a pure Margo test case that
just sends a single rpc of the same size
as dataspaces iteratively, whre
the server side does an 8-byte bulk
transfer initiated by the server, and
sends a response. The results
are similar, except that it is now the
4th and 8th timestep that are slow
(and the first timestep is VERY
long, presumably because
rxm communication state is
being established. DataSpaces has an
earlier RPC in its init that
was absorbing this latency).<br class="">
<br class="">
I got margo profiling results for this
test case:<br class="">
<br class="">
```<br class="">
3<br class="">
18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573<br class="">
0xa2a1,term_rpc<br class="">
0x27b5,put_rpc<br class="">
0xd320,__shutdown__<br class="">
0x27b5 ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
0x27b5 ,0;0.041241646,200.000000000, 0;<br class="">
0xa2a1 ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0<br class="">
0xa2a1 ,0;0.000009298,1.000000000, 0;<br class="">
```<br class="">
<br class="">
So I guess my question at this point
is, is there any sensible reason why
the 4th and 8th RPC sent would have a
long response time? I think I’ve cleared
my code on the client side and
server side, so it appears to be latency
being introduced
by Margo, LibFabric, Argobots, or the
underlying OS. I do see long
timesteps occasionally after
this (perhaps every 20-30 timesteps)
but these are not consistent.<br class="">
<br class="">
One last detail: this does not happen
on Summit. On summit, I see about
5-7x worse single-RPC performance (250-350 microseconds per
RPC), but without
the intermittent long timesteps.<br class="">
<br class="">
I can provide the minimal test case
if it would be helpful. I am using APEX
for timing results, and the
following dependencies with Spack:<br class="">
<br class="">
<a class="moz-txt-link-abbreviated" href="mailto:argobots@1.1">argobots@1.1</a> <a class="moz-txt-link-abbreviated" href="mailto:json-c@0.15">json-c@0.15</a> <a class="moz-txt-link-abbreviated" href="mailto:libfabric@1.13.1">libfabric@1.13.1</a> <a class="moz-txt-link-abbreviated" href="mailto:mercury@2.0.1">mercury@2.0.1</a> <a class="moz-txt-link-abbreviated" href="mailto:mochi-margo@0.9.5">mochi-margo@0.9.5</a> rdma-core@20<br class="">
<br class="">
Thanks,<br class="">
Philip<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
<div class="">_______________________________________________</div>
<div class="">mochi-devel mailing list</div>
<br class="Apple-interchange-newline">
<a class="moz-txt-link-abbreviated" href="mailto:mochi-devel@lists.mcs.anl.gov">mochi-devel@lists.mcs.anl.gov</a><br class="">
<a class="moz-txt-link-freetext" href="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
<a class="moz-txt-link-freetext" href="https://www.mcs.anl.gov/research/projects/mochi">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
</blockquote>
_______________________________________________<br class="">
mochi-devel mailing list<br class="">
<a class="moz-txt-link-abbreviated" href="mailto:mochi-devel@lists.mcs.anl.gov">mochi-devel@lists.mcs.anl.gov</a><br class="">
<a class="moz-txt-link-freetext" href="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br class="">
<a class="moz-txt-link-freetext" href="https://www.mcs.anl.gov/research/projects/mochi">https://www.mcs.anl.gov/research/projects/mochi</a><br class="">
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</blockquote>
</blockquote>
<br class="">
</span>
</blockquote>
</body>
</html>