[Mochi-devel] RPC latency question
Phil Carns
carns at mcs.anl.gov
Fri Nov 5 11:40:00 CDT 2021
On 11/5/21 12:23 PM, Philip Davis wrote:
> That’s a good find. I’m reading through the code and documentation,
> and I’m having a little trouble understanding what the difference
> between 0 and -1 for that last argument is when the second to last
> argument is 0. I see in documentation:
>
> The third argument indicates whether an Argobots execution stream (ES)
> should be created to run the Mercury progress loop. If this argument
> is set to 0, the progress loop is going to run in the context of the
> main ES (this should be the standard scenario, unless you have a good
> reason for not using the main ES, such as the main ES using MPI
> primitives that could block the progress loop). A value of 1 will make
> Margo create an ES to run the Mercury progress loop. The fourth
> argument is the number of ES to create and use for executing RPC
> handlers. A value of 0 will make Margo execute RPCs in the ES that
> called |margo_init|. A value of -1 will make Margo execute the RPCs in
> the ES running the progress loop. A positive value will make Margo
> create new ESs to run the RPCs.
>
> What is the difference between the ‘main ES’ (last two arguments 0,-1)
> and the 'ES that called margo_init’ (last two arguments 0,0) in the
> absence of me creating new execution streams? Or maybe I’m
> not interpreting the documentation correctly?
You are right, those are equivalent when the next to last argument is 0
:) The main ES and progress thread ES are one and the same in that
case, so the RPCs go to the same place either way.
I've narrowed down a little further and found that the stalls occur when
there are 2 dedicated RPC ESs but not when there is just 1 dedicated RPC
ES. That isolates the problem slightly further, in that it's not just
some cost associated with relaying ULTs to another thread (that's
happening every time in the 1 handler case too), but something that only
happens when multiple ESes could potentially service an RPC.
thanks,
-Phil
>> On Nov 5, 2021, at 10:29 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>
>> Srinivasan Ramesh (U. Oregon) has done some work on fine-grained RPC
>> component timing, but it's not in the mainline margo tree so we'll
>> need to do a little more work to look at it.
>>
>> In the mean time on a hunch I found that I can make the
>> latency consistent on Cooley by altering the margo_init() arguments
>> to be (..., 0, -1) in server.c (meaning that no additional
>> execution streams are used at all; all mercury progress and all rpc
>> handlers are executed using user-level threads in the process's
>> primary execution stream (OS thread).
>>
>> It's expected that there would be some more jitter jumping across OS
>> threads for RPC handling, but it shouldn't be that extreme, regular,
>> or system-specific.
>>
>> Thanks again for the test case and the Apex instrumentation; this is
>> the sort of thing that's normally really hard to isolate.
>>
>> thanks,
>>
>> -Phil
>>
>> On 11/5/21 10:09 AM, Philip Davis wrote:
>>> That’s extremely interesting.
>>>
>>> Are there any internal timers in Margo that can tell what the delay
>>> was between the server’s progress thread queueing the rpc and the
>>> handler thread starting to handle it? If I’m understanding
>>> https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html correctly,
>>> it seems to me like that is the most likely place for
>>> non-deterministic delay to be introduced by argobots in the client
>>> -> server direction.
>>>
>>> I just ran a quick test where I changed the number of handler
>>> threads to 5, and I saw no change in behavior (still 4 and 8, not 5
>>> and 10).
>>>
>>>
>>>> On Nov 4, 2021, at 9:04 PM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>
>>>> I have some more interesting info to share from trying a few
>>>> different configurations.
>>>>
>>>> sm (on my laptop) and ofi+gni (on theta) do not exhibit this
>>>> behavior; they have consistent performance across RPCs.
>>>>
>>>> ofi+verbs (on cooley) shows the same thing you were seeing; the 4th
>>>> and 8th RPCs are slow.
>>>>
>>>> Based on the above, it would sound like a problem with the
>>>> libfabric/verbs path. But on Jerome's suggestion I tried some
>>>> other supported transports on cooley as well. In particular I
>>>> ran the same benchmark (the same build in fact, I just compiled in
>>>> support for multiple transports and cycling through them in a
>>>> single job script with runtime options) with these combinations:
>>>>
>>>> • ofi+verbs
>>>> • ofi+tcp
>>>> • ofi+sockets
>>>> • bmi+tcp
>>>> All of them show the same thing! 4th and 8th RPCs are at least an
>>>> order of magnitude slower than the other RPCs. That was a
>>>> surprising result. The bmi+tcp one isn't even using libfabric at
>>>> all, even though they are all using the same underlying hardware.
>>>>
>>>> I'm not sure what to make of that yet. Possibly something with
>>>> threading or signalling?
>>>>
>>>> thanks,
>>>>
>>>> -Phil
>>>>
>>>> On 11/2/21 2:37 PM, Philip Davis wrote:
>>>>> I’m glad you were able to reproduce it on a different system,
>>>>> thanks for letting me know. I’m not sure what the overlaps between
>>>>> Frontera and Cooley are (that aren’t shared by Summit); a
>>>>> quick look shows they are both Intel, and both FDR, but there’s
>>>>> probably more salient details.
>>>>>
>>>>>> On Nov 2, 2021, at 2:24 PM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>>>
>>>>>> Ok. Interesting. I didn't see anything unusual in the timing on
>>>>>> my laptop with sm (other than it being a bit slow, but I wasn't
>>>>>> tuning or worrying about core affinity or anything). On Cooley,
>>>>>> a somewhat older Linux cluster with InfiniBand, I see the 4th and
>>>>>> 8th RPC delay you were talking about:
>>>>>>
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent GUID":0}},
>>>>>>
>>>>>> (assuming the first is high due to connection establishment)
>>>>>>
>>>>>> I'll check some other systems/transports, but I wanted to
>>>>>> go ahead and share that I've been able to reproduce what you were
>>>>>> seeing.
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> -Phil
>>>>>>
>>>>>> On 11/2/21 1:49 PM, Philip Davis wrote:
>>>>>>> Glad that’s working now.
>>>>>>>
>>>>>>> It is the put_wait events, and “dur” is the right field. Those
>>>>>>> units are microseconds.
>>>>>>>
>>>>>>>
>>>>>>>> On Nov 2, 2021, at 1:12 PM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>>>>>
>>>>>>>> Thanks Philip, the "= {0};" initialization of that struct got
>>>>>>>> me going.
>>>>>>>>
>>>>>>>> I can run the test case now and it is producing output in
>>>>>>>> the client and server perf dirs. Just to sanity check what to
>>>>>>>> look for, I think the problem should be exhibited in
>>>>>>>> the "put_wait" or maybe "do_put" trace events on the client?
>>>>>>>> For example on my laptop I see this:
>>>>>>>>
>>>>>>>> carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep
>>>>>>>> do_put trace_events.0.json
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent GUID":0}},
>>>>>>>> carns-x1-7g ~/w/d/d/m/client.perf> grep
>>>>>>>> put_wait trace_events.0.json
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent GUID":0}},
>>>>>>>>
>>>>>>>> I should look at the "dur" field right? What are the units
>>>>>>>> on that?
>>>>>>>>
>>>>>>>> I'll see if I can run this on a "real" system shortly.
>>>>>>>>
>>>>>>>> thanks!
>>>>>>>>
>>>>>>>> -Phil
>>>>>>>>
>>>>>>>> On 11/2/21 12:11 PM, Philip Davis wrote:
>>>>>>>>> Hi Phil,
>>>>>>>>>
>>>>>>>>> Sorry the data structures are like that; I wanted to preserve
>>>>>>>>> as much of the RPC size and ordering in case it ended up
>>>>>>>>> being important.
>>>>>>>>>
>>>>>>>>> I’m surprised in.odsc.size is troublesome, as I set
>>>>>>>>> in.odsc.size with the line `in.odsc.size = sizeof(odsc);`. I’m
>>>>>>>>> not sure what could be corrupting that value in the meantime.
>>>>>>>>>
>>>>>>>>> I don’t set in.odsc.gdim_size (which was an oversight, since
>>>>>>>>> that’s non-zero normally), so I’m less surprised that’s
>>>>>>>>> an issue. I thought I initialized `in` to zero, but I see I
>>>>>>>>> didn’t do that after all.
>>>>>>>>>
>>>>>>>>> Maybe change the line `bulk_gdim_t in;` to `bulk_gdim_t in = {0};`
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Nov 2, 2021, at 11:48 AM, Phil Carns
>>>>>>>>>> <carns at mcs.anl.gov> wrote:
>>>>>>>>>>
>>>>>>>>>> Awesome, thanks Philip. It came through fine.
>>>>>>>>>>
>>>>>>>>>> I started by modifying the job script slightly to just run it
>>>>>>>>>> on my laptop with sm (I wanted to make sure I understood
>>>>>>>>>> the test case, and how to use apex,
>>>>>>>>>> before trying elsewhere). Does in.size needs to be set
>>>>>>>>>> in client.c? For me there is a random value in that
>>>>>>>>>> field and it is causing the encoder on
>>>>>>>>>> the forward to attempt a very large allocation. The same
>>>>>>>>>> might be true of gdim_size if it got past that step.
>>>>>>>>>> I started to alter them but then I wasn't sure what
>>>>>>>>>> the implications were.
>>>>>>>>>>
>>>>>>>>>> (fwiw I needed to include stdlib.h in common.h, but I've hit
>>>>>>>>>> that a couple of times recently on codes
>>>>>>>>>> that didn't previously generate warnings; I think something
>>>>>>>>>> in Ubuntu has gotten strict about that recently).
>>>>>>>>>>
>>>>>>>>>> thanks,
>>>>>>>>>>
>>>>>>>>>> -Phil
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 11/1/21 4:51 PM, Philip Davis wrote:
>>>>>>>>>>> Hi Phil,
>>>>>>>>>>>
>>>>>>>>>>> I’ve attached the reproducer. I see the 4th and 8th issue on
>>>>>>>>>>> Frontera, but not Summit. Hopefully it will build and run
>>>>>>>>>>> without too much modification. Let me know if there are
>>>>>>>>>>> any issues with running it (or if the anl listserv eats the
>>>>>>>>>>> tarball, which I kind of expect).
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Philip
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Nov 1, 2021, at 11:14 AM, Phil Carns
>>>>>>>>>>>> <carns at mcs.anl.gov> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Philip,
>>>>>>>>>>>>
>>>>>>>>>>>> (FYI I think the first image didn't come through in your
>>>>>>>>>>>> email, but I think the others are sufficient to get
>>>>>>>>>>>> across what you are seeing)
>>>>>>>>>>>>
>>>>>>>>>>>> I don't have any idea what would cause that.
>>>>>>>>>>>> The recently released libfabric 1.13.2 (available in spack
>>>>>>>>>>>> from the mochi-spack-packages repo) includes some fixes to
>>>>>>>>>>>> the rxm provider that could be relevant to Frontera
>>>>>>>>>>>> and Summit, but nothing that aligns with what you
>>>>>>>>>>>> are observing.
>>>>>>>>>>>>
>>>>>>>>>>>> If it were later in the sequence (much later) I
>>>>>>>>>>>> would speculate that memory allocation/deallocation cycles
>>>>>>>>>>>> were eventually causing a hiccup. We've seen
>>>>>>>>>>>> something like that in the past, and it's a theory that we
>>>>>>>>>>>> could then test
>>>>>>>>>>>> with alternative allocators like jemalloc. That's
>>>>>>>>>>>> not memory allocation jitter that early in the run though.
>>>>>>>>>>>>
>>>>>>>>>>>> Please do share your reproducer if you don't mind! We
>>>>>>>>>>>> can try a few systems here to at least isolate if it is
>>>>>>>>>>>> something peculiar to the InfiniBand path or if there is
>>>>>>>>>>>> a more general problem in Margo.
>>>>>>>>>>>>
>>>>>>>>>>>> thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> -Phil
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/29/21 3:20 PM, Philip Davis wrote:
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I apologize in advance for the winding nature of
>>>>>>>>>>>>> this email. I’m not sure how to ask
>>>>>>>>>>>>> my question without explaining the story of my results some.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’m doing some characterization of our server performance
>>>>>>>>>>>>> under load, and I have a quirk of performance that I
>>>>>>>>>>>>> wanted to run by people to see if they make sense.
>>>>>>>>>>>>> My testing so far has been to iteratively send batches of
>>>>>>>>>>>>> RPCs using margo_iforward, and then measure the wait time
>>>>>>>>>>>>> until they are all complete. On the server side,
>>>>>>>>>>>>> handling the RPC includes a margo_bulk_transfer as a
>>>>>>>>>>>>> pull initiated on the server to pull (for now) 8 bytes.
>>>>>>>>>>>>> The payload of the RPC request is about 500 bytes, and
>>>>>>>>>>>>> the response payload is 4 bytes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’ve isolated my results down to one server rank and
>>>>>>>>>>>>> one client rank, because it’s an easier starting point to
>>>>>>>>>>>>> reason from. Below is a graph of some of
>>>>>>>>>>>>> my initial results. These results are from Frontera. The
>>>>>>>>>>>>> median times are good (a single RPC takes on the order of
>>>>>>>>>>>>> 10s of microseconds, which seems fantastic). However,
>>>>>>>>>>>>> the outliers are fairly high (note the log scale of
>>>>>>>>>>>>> the y-axis). With only one RPC per timestep, for
>>>>>>>>>>>>> example, there is a 30x spread between the median and the max.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was hoping (expecting) the first timestep would be where
>>>>>>>>>>>>> the long replies resided, but that turned out not to
>>>>>>>>>>>>> be the case. Below are traces from the 1 RPC (blue) and
>>>>>>>>>>>>> 2 RPC (orange) per timestep cases, 5 trials of
>>>>>>>>>>>>> 10 timesteps for each case (normalized to fix the same
>>>>>>>>>>>>> x-axis):
>>>>>>>>>>>>>
>>>>>>>>>>>>> <PastedGraphic-6.png>
>>>>>>>>>>>>>
>>>>>>>>>>>>> What strikes me is how consistent these results are
>>>>>>>>>>>>> across trials. For the 1 RPC per timestep case, the 3rd
>>>>>>>>>>>>> and 7th timestep are consistently slow (and the rest are
>>>>>>>>>>>>> fast). For the 2 RPC per timestep case, the 2nd and 4th
>>>>>>>>>>>>> timestep are always slow and sometimes the
>>>>>>>>>>>>> 10th is. These results are repeatable with very
>>>>>>>>>>>>> rare variation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For the single RPC case, I recorded some timers on
>>>>>>>>>>>>> the server side, and attempted to overlay them with
>>>>>>>>>>>>> the client side (there is some unknown offset,
>>>>>>>>>>>>> but probably on the order of 10s of microseconds at
>>>>>>>>>>>>> worst, given the pattern):
>>>>>>>>>>>>>
>>>>>>>>>>>>> <PastedGraphic-7.png>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I blew up the first few timesteps of one of the trials:
>>>>>>>>>>>>> <PastedGraphic-8.png>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The different colors are different segments of the
>>>>>>>>>>>>> handler, but there doesn’t seem to be
>>>>>>>>>>>>> anything too interesting going on inside the handler. So
>>>>>>>>>>>>> it looks like the time is being introduced before the
>>>>>>>>>>>>> 3rd RPC handler starts, based on the where the gap appears
>>>>>>>>>>>>> on the server side.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To try and isolate any dataspaces-specific behavior,
>>>>>>>>>>>>> I created a pure Margo test case that just sends a
>>>>>>>>>>>>> single rpc of the same size
>>>>>>>>>>>>> as dataspaces iteratively, whre the server side does an
>>>>>>>>>>>>> 8-byte bulk transfer initiated by the server, and sends
>>>>>>>>>>>>> a response. The results are similar, except that it is now
>>>>>>>>>>>>> the 4th and 8th timestep that are slow (and the
>>>>>>>>>>>>> first timestep is VERY long, presumably because
>>>>>>>>>>>>> rxm communication state is being established. DataSpaces
>>>>>>>>>>>>> has an earlier RPC in its init that was absorbing
>>>>>>>>>>>>> this latency).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I got margo profiling results for this test case:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ```
>>>>>>>>>>>>> 3
>>>>>>>>>>>>> 18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573
>>>>>>>>>>>>> 0xa2a1,term_rpc
>>>>>>>>>>>>> 0x27b5,put_rpc
>>>>>>>>>>>>> 0xd320,__shutdown__
>>>>>>>>>>>>> 0x27b5 ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>>>>>>>>>> 0x27b5 ,0;0.041241646,200.000000000, 0;
>>>>>>>>>>>>> 0xa2a1 ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>>>>>>>>>> 0xa2a1 ,0;0.000009298,1.000000000, 0;
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> So I guess my question at this point is, is there
>>>>>>>>>>>>> any sensible reason why the 4th and 8th RPC sent
>>>>>>>>>>>>> would have a long response time? I think I’ve cleared
>>>>>>>>>>>>> my code on the client side and server side, so it appears
>>>>>>>>>>>>> to be latency being introduced
>>>>>>>>>>>>> by Margo, LibFabric, Argobots, or the underlying OS. I do
>>>>>>>>>>>>> see long timesteps occasionally after this (perhaps
>>>>>>>>>>>>> every 20-30 timesteps) but these are not consistent.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One last detail: this does not happen on Summit.
>>>>>>>>>>>>> On summit, I see about
>>>>>>>>>>>>> 5-7x worse single-RPC performance (250-350 microseconds per
>>>>>>>>>>>>> RPC), but without the intermittent long timesteps.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can provide the minimal test case if it would
>>>>>>>>>>>>> be helpful. I am using APEX for timing results, and the
>>>>>>>>>>>>> following dependencies with Spack:
>>>>>>>>>>>>>
>>>>>>>>>>>>> argobots at 1.1 json-c at 0.15 libfabric at 1.13.1 mercury at 2.0.1
>>>>>>>>>>>>> mochi-margo at 0.9.5 rdma-core at 20
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Philip
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> mochi-devel mailing list
>>>>>>>>>>>>>
>>>>>>>>>>>>> mochi-devel at lists.mcs.anl.gov
>>>>>>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>>>>>>>>>>> https://www.mcs.anl.gov/research/projects/mochi
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> mochi-devel mailing list
>>>>>>>>>>>> mochi-devel at lists.mcs.anl.gov
>>>>>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>>>>>>>>>> https://www.mcs.anl.gov/research/projects/mochi
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211105/0f0fdb61/attachment-0001.html>
More information about the mochi-devel
mailing list