[Mochi-devel] RPC latency question

Fri Nov 5 11:40:00 CDT 2021

On 11/5/21 12:23 PM, Philip Davis wrote:
> That’s a good find. I’m reading through the code and documentation, 
> and I’m having a little trouble understanding what the difference 
> between 0 and -1 for that last argument is when the second to last 
> argument is 0. I see in documentation:
>
> The third argument indicates whether an Argobots execution stream (ES) 
> should be created to run the Mercury progress loop. If this argument 
> is set to 0, the progress loop is going to run in the context of the 
> main ES (this should be the standard scenario, unless you have a good 
> reason for not using the main ES, such as the main ES using MPI 
> primitives that could block the progress loop). A value of 1 will make 
> Margo create an ES to run the Mercury progress loop. The fourth 
> argument is the number of ES to create and use for executing RPC 
> handlers. A value of 0 will make Margo execute RPCs in the ES that 
> called |margo_init|. A value of -1 will make Margo execute the RPCs in 
> the ES running the progress loop. A positive value will make Margo 
> create new ESs to run the RPCs.
>
> What is the difference between the ‘main ES’ (last two arguments 0,-1) 
> and the 'ES that called margo_init’  (last two arguments 0,0) in the 
> absence of me creating new execution streams? Or maybe I’m 
> not interpreting the documentation correctly?

You are right, those are equivalent when the next to last argument is 0 
:)  The main ES and progress thread ES are one and the same in that 
case, so the RPCs go to the same place either way.

I've narrowed down a little further and found that the stalls occur when 
there are 2 dedicated RPC ESs but not when there is just 1 dedicated RPC 
ES.  That isolates the problem slightly further, in that it's not just 
some cost associated with relaying ULTs to another thread (that's 
happening every time in the 1 handler case too), but something that only 
happens when multiple ESes could potentially service an RPC.

thanks,

-Phil

>> On Nov 5, 2021, at 10:29 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>
>> Srinivasan Ramesh (U. Oregon) has done some work on fine-grained RPC 
>> component timing, but it's not in the mainline margo tree so we'll 
>> need to do a little more work to look at it.
>>
>> In the mean time on a hunch I found that I can make the 
>> latency consistent on Cooley by altering the margo_init() arguments 
>> to be (..., 0, -1) in server.c (meaning that no additional 
>> execution streams are used at all; all mercury progress and all rpc 
>> handlers are executed using user-level threads in the process's 
>> primary execution stream (OS thread).
>>
>> It's expected that there would be some more jitter jumping across OS 
>> threads for RPC handling, but it shouldn't be that extreme, regular, 
>> or system-specific.
>>
>> Thanks again for the test case and the Apex instrumentation; this is 
>> the sort of thing that's normally really hard to isolate.
>>
>> thanks,
>>
>> -Phil
>>
>> On 11/5/21 10:09 AM, Philip Davis wrote:
>>> That’s extremely interesting.
>>>
>>> Are there any internal timers in Margo that can tell what the delay 
>>> was between the server’s progress thread queueing the rpc and the 
>>> handler thread starting to handle it? If I’m understanding 
>>> https://mochi.readthedocs.io/en/latest/general/03_rpc_model.html correctly, 
>>> it seems to me like that is the most likely place for 
>>> non-deterministic delay to be introduced by argobots in the client 
>>> -> server direction.
>>>
>>> I just ran a quick test where I changed the number of handler 
>>> threads to 5, and I saw no change in behavior (still 4 and 8, not 5 
>>> and 10).
>>>
>>>
>>>> On Nov 4, 2021, at 9:04 PM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>
>>>> I have some more interesting info to share from trying a few 
>>>> different configurations.
>>>>
>>>> sm (on my laptop) and ofi+gni (on theta) do not exhibit this 
>>>> behavior; they have consistent performance across RPCs.
>>>>
>>>> ofi+verbs (on cooley) shows the same thing you were seeing; the 4th 
>>>> and 8th RPCs are slow.
>>>>
>>>> Based on the above, it would sound like a problem with the 
>>>> libfabric/verbs path.  But on Jerome's suggestion I tried some 
>>>> other supported transports on cooley as well.  In particular I 
>>>> ran the same benchmark (the same build in fact, I just compiled in 
>>>> support for multiple transports and cycling through them in a 
>>>> single job script with runtime options) with these combinations:
>>>>
>>>> • ofi+verbs
>>>> • ofi+tcp
>>>> • ofi+sockets
>>>> • bmi+tcp
>>>> All of them show the same thing!  4th and 8th RPCs are at least an 
>>>> order of magnitude slower than the other RPCs.  That was a 
>>>> surprising result.  The bmi+tcp one isn't even using libfabric at 
>>>> all, even though they are all using the same underlying hardware.
>>>>
>>>> I'm not sure what to make of that yet.  Possibly something with 
>>>> threading or signalling?
>>>>
>>>> thanks,
>>>>
>>>> -Phil
>>>>
>>>> On 11/2/21 2:37 PM, Philip Davis wrote:
>>>>> I’m glad you were able to reproduce it on a different system, 
>>>>> thanks for letting me know. I’m not sure what the overlaps between 
>>>>> Frontera and Cooley are (that aren’t shared by Summit); a 
>>>>> quick look shows they are both Intel, and both FDR, but there’s 
>>>>> probably more salient details.
>>>>>
>>>>>> On Nov 2, 2021, at 2:24 PM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>>>
>>>>>> Ok.  Interesting.  I didn't see anything unusual in the timing on 
>>>>>> my laptop with sm (other than it being a bit slow, but I wasn't 
>>>>>> tuning or worrying about core affinity or anything).  On Cooley, 
>>>>>> a somewhat older Linux cluster with InfiniBand, I see the 4th and 
>>>>>> 8th RPC delay you were talking about:
>>>>>>
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent GUID":0}},
>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent GUID":0}},
>>>>>>
>>>>>> (assuming the first is high due to connection establishment)
>>>>>>
>>>>>> I'll check some other systems/transports, but I wanted to 
>>>>>> go ahead and share that I've been able to reproduce what you were 
>>>>>> seeing.
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> -Phil
>>>>>>
>>>>>> On 11/2/21 1:49 PM, Philip Davis wrote:
>>>>>>> Glad that’s working now.
>>>>>>>
>>>>>>> It is the put_wait events, and “dur” is the right field. Those 
>>>>>>> units are microseconds.
>>>>>>>
>>>>>>>
>>>>>>>> On Nov 2, 2021, at 1:12 PM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>>>>>
>>>>>>>> Thanks Philip, the "= {0};" initialization of that struct got 
>>>>>>>> me going.
>>>>>>>>
>>>>>>>> I can run the test case now and it is producing output in 
>>>>>>>> the client and server perf dirs.  Just to sanity check what to 
>>>>>>>> look for, I think the problem should be exhibited in 
>>>>>>>> the "put_wait" or maybe "do_put" trace events on the client? 
>>>>>>>>  For example on my laptop I see this:
>>>>>>>>
>>>>>>>> carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep 
>>>>>>>> do_put trace_events.0.json
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent GUID":0}},
>>>>>>>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent GUID":0}},
>>>>>>>> carns-x1-7g ~/w/d/d/m/client.perf> grep 
>>>>>>>> put_wait trace_events.0.json
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent GUID":0}},
>>>>>>>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent GUID":0}},
>>>>>>>>
>>>>>>>> I should look at the "dur" field right?  What are the units 
>>>>>>>> on that?
>>>>>>>>
>>>>>>>> I'll see if I can run this on a "real" system shortly.
>>>>>>>>
>>>>>>>> thanks!
>>>>>>>>
>>>>>>>> -Phil
>>>>>>>>
>>>>>>>> On 11/2/21 12:11 PM, Philip Davis wrote:
>>>>>>>>> Hi Phil,
>>>>>>>>>
>>>>>>>>> Sorry the data structures are like that; I wanted to preserve 
>>>>>>>>> as much of the RPC size and ordering in case it ended up 
>>>>>>>>> being important.
>>>>>>>>>
>>>>>>>>> I’m surprised in.odsc.size is troublesome, as I set 
>>>>>>>>> in.odsc.size with the line `in.odsc.size = sizeof(odsc);`. I’m 
>>>>>>>>> not sure what could be corrupting that value in the meantime.
>>>>>>>>>
>>>>>>>>> I don’t set in.odsc.gdim_size (which was an oversight, since 
>>>>>>>>> that’s non-zero normally), so I’m less surprised that’s 
>>>>>>>>> an issue. I thought I initialized `in` to zero, but I see I 
>>>>>>>>> didn’t do that after all.
>>>>>>>>>
>>>>>>>>> Maybe change the line `bulk_gdim_t in;` to `bulk_gdim_t in = {0};`
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Nov 2, 2021, at 11:48 AM, Phil Carns 
>>>>>>>>>> <carns at mcs.anl.gov> wrote:
>>>>>>>>>>
>>>>>>>>>> Awesome, thanks Philip.  It came through fine.
>>>>>>>>>>
>>>>>>>>>> I started by modifying the job script slightly to just run it 
>>>>>>>>>> on my laptop with sm (I wanted to make sure I understood 
>>>>>>>>>> the test case, and how to use apex, 
>>>>>>>>>> before trying elsewhere).  Does in.size needs to be set 
>>>>>>>>>> in client.c?  For me there is a random value in that 
>>>>>>>>>> field and it is causing the                 encoder on 
>>>>>>>>>> the forward to attempt a very large allocation.  The same 
>>>>>>>>>> might be true of gdim_size if it got past that step. 
>>>>>>>>>>  I started to alter them but then I wasn't sure what 
>>>>>>>>>> the implications were.
>>>>>>>>>>
>>>>>>>>>> (fwiw I needed to include stdlib.h in common.h, but I've hit 
>>>>>>>>>> that a couple of times recently on codes 
>>>>>>>>>> that didn't previously generate warnings; I think something 
>>>>>>>>>> in Ubuntu has gotten strict about that recently).
>>>>>>>>>>
>>>>>>>>>> thanks,
>>>>>>>>>>
>>>>>>>>>> -Phil
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 11/1/21 4:51 PM, Philip Davis wrote:
>>>>>>>>>>> Hi Phil,
>>>>>>>>>>>
>>>>>>>>>>> I’ve attached the reproducer. I see the 4th and 8th issue on 
>>>>>>>>>>> Frontera, but not Summit. Hopefully it will build and run 
>>>>>>>>>>> without too much modification. Let me know if there are 
>>>>>>>>>>> any issues with running it (or if the anl listserv eats the 
>>>>>>>>>>> tarball, which I kind of expect).
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Philip
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Nov 1, 2021, at 11:14 AM, Phil Carns 
>>>>>>>>>>>> <carns at mcs.anl.gov> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Philip,
>>>>>>>>>>>>
>>>>>>>>>>>> (FYI I think the first image didn't come through in your 
>>>>>>>>>>>> email, but I think the others are sufficient to get 
>>>>>>>>>>>> across what you are seeing)
>>>>>>>>>>>>
>>>>>>>>>>>> I don't have any idea what would cause that. 
>>>>>>>>>>>>  The recently released libfabric 1.13.2 (available in spack 
>>>>>>>>>>>> from the mochi-spack-packages repo) includes some fixes to 
>>>>>>>>>>>> the rxm provider that could be relevant to Frontera 
>>>>>>>>>>>> and Summit, but nothing that aligns with what you 
>>>>>>>>>>>> are observing.
>>>>>>>>>>>>
>>>>>>>>>>>> If it were later in the sequence (much later) I 
>>>>>>>>>>>> would speculate that memory allocation/deallocation cycles 
>>>>>>>>>>>> were eventually causing a hiccup.  We've seen 
>>>>>>>>>>>> something like that in the past, and it's a theory that we 
>>>>>>>>>>>> could then test 
>>>>>>>>>>>> with alternative allocators like jemalloc.  That's 
>>>>>>>>>>>> not memory allocation jitter that early in the run though.
>>>>>>>>>>>>
>>>>>>>>>>>> Please do share your reproducer if you don't mind!  We 
>>>>>>>>>>>> can try a few systems here to at least isolate if it is 
>>>>>>>>>>>> something peculiar to the InfiniBand path or if there is 
>>>>>>>>>>>> a more general problem in Margo.
>>>>>>>>>>>>
>>>>>>>>>>>> thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> -Phil
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/29/21 3:20 PM, Philip Davis wrote:
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I apologize in advance for the winding nature of 
>>>>>>>>>>>>> this email. I’m not sure how to ask 
>>>>>>>>>>>>> my question without explaining the story of my results some.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’m doing some characterization of our server performance 
>>>>>>>>>>>>> under load, and I have a quirk of performance that I 
>>>>>>>>>>>>> wanted to run by people to see if they make sense. 
>>>>>>>>>>>>> My testing so far has been to iteratively send batches of 
>>>>>>>>>>>>> RPCs using margo_iforward, and then measure the wait time 
>>>>>>>>>>>>> until they are all complete. On the server side, 
>>>>>>>>>>>>> handling the RPC includes a margo_bulk_transfer as a 
>>>>>>>>>>>>> pull initiated on the server to pull (for now) 8 bytes. 
>>>>>>>>>>>>> The payload of the RPC request is about 500 bytes, and 
>>>>>>>>>>>>> the response payload is 4 bytes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’ve isolated my results down to one server rank and 
>>>>>>>>>>>>> one client rank, because it’s an easier starting point to 
>>>>>>>>>>>>> reason from. Below is a graph of some of 
>>>>>>>>>>>>> my initial results. These results are from Frontera. The 
>>>>>>>>>>>>> median times are good (a single RPC takes on the order of 
>>>>>>>>>>>>> 10s   of microseconds, which seems fantastic). However, 
>>>>>>>>>>>>> the outliers are fairly high (note the log scale of 
>>>>>>>>>>>>> the y-axis). With only one RPC per timestep, for 
>>>>>>>>>>>>> example, there is a 30x spread between the median and the max.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was hoping (expecting) the first timestep would be where 
>>>>>>>>>>>>> the long replies resided, but that turned out not to 
>>>>>>>>>>>>> be the case. Below are traces from the 1 RPC (blue) and 
>>>>>>>>>>>>> 2 RPC  (orange) per timestep cases, 5 trials of 
>>>>>>>>>>>>> 10 timesteps for each case (normalized to fix the same     
>>>>>>>>>>>>>       x-axis):
>>>>>>>>>>>>>
>>>>>>>>>>>>> <PastedGraphic-6.png>
>>>>>>>>>>>>>
>>>>>>>>>>>>> What strikes me is how consistent these results are 
>>>>>>>>>>>>> across trials. For the 1 RPC per timestep case, the 3rd 
>>>>>>>>>>>>> and 7th timestep are consistently slow (and the rest are 
>>>>>>>>>>>>> fast). For the 2 RPC per timestep case, the 2nd and 4th 
>>>>>>>>>>>>> timestep are always slow and sometimes the                 
>>>>>>>>>>>>>       10th is. These results are repeatable with very 
>>>>>>>>>>>>> rare variation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For the single RPC case, I recorded some timers on 
>>>>>>>>>>>>> the server side, and attempted to overlay them with 
>>>>>>>>>>>>> the client side (there is some unknown offset, 
>>>>>>>>>>>>> but probably on the order of 10s of microseconds at 
>>>>>>>>>>>>> worst, given the pattern):
>>>>>>>>>>>>>
>>>>>>>>>>>>> <PastedGraphic-7.png>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I blew up the first few timesteps of one of the trials:
>>>>>>>>>>>>> <PastedGraphic-8.png>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The different colors are different segments of the 
>>>>>>>>>>>>> handler, but there doesn’t seem to be 
>>>>>>>>>>>>> anything too interesting going on inside the handler. So 
>>>>>>>>>>>>> it looks like the time is being introduced before the 
>>>>>>>>>>>>> 3rd RPC handler starts, based on the where the gap appears 
>>>>>>>>>>>>> on the server side.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To try and isolate any dataspaces-specific behavior, 
>>>>>>>>>>>>> I created a pure Margo test case that just sends a 
>>>>>>>>>>>>> single rpc of the same size 
>>>>>>>>>>>>> as dataspaces iteratively, whre the server side does an 
>>>>>>>>>>>>> 8-byte bulk transfer initiated by the server, and sends 
>>>>>>>>>>>>> a response. The results are similar, except that it is now 
>>>>>>>>>>>>> the 4th and 8th timestep that are slow (and the 
>>>>>>>>>>>>> first timestep is VERY long, presumably because 
>>>>>>>>>>>>> rxm communication state is being established. DataSpaces 
>>>>>>>>>>>>> has an earlier RPC in its init that was absorbing 
>>>>>>>>>>>>> this latency).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I got margo profiling results for this test case:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ```
>>>>>>>>>>>>> 3
>>>>>>>>>>>>> 18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573
>>>>>>>>>>>>> 0xa2a1,term_rpc
>>>>>>>>>>>>> 0x27b5,put_rpc
>>>>>>>>>>>>> 0xd320,__shutdown__
>>>>>>>>>>>>> 0x27b5 ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>>>>>>>>>> 0x27b5 ,0;0.041241646,200.000000000, 0;
>>>>>>>>>>>>> 0xa2a1 ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>>>>>>>>>> 0xa2a1 ,0;0.000009298,1.000000000, 0;
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> So I guess my question at this point is, is there 
>>>>>>>>>>>>> any sensible reason why the 4th and 8th RPC sent 
>>>>>>>>>>>>> would have a long response time? I think I’ve cleared 
>>>>>>>>>>>>> my code on the client side and server side, so it appears 
>>>>>>>>>>>>> to be latency being introduced 
>>>>>>>>>>>>> by Margo, LibFabric, Argobots, or the underlying OS. I do 
>>>>>>>>>>>>> see long timesteps occasionally after this (perhaps 
>>>>>>>>>>>>> every 20-30 timesteps) but these are not consistent.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One last detail: this does not happen on Summit. 
>>>>>>>>>>>>> On summit, I see about 
>>>>>>>>>>>>> 5-7x worse single-RPC performance (250-350 microseconds per 
>>>>>>>>>>>>> RPC), but without the intermittent long timesteps.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can provide the minimal test case if it would 
>>>>>>>>>>>>> be helpful. I am using APEX for timing results, and the 
>>>>>>>>>>>>> following dependencies with Spack:
>>>>>>>>>>>>>
>>>>>>>>>>>>> argobots at 1.1 json-c at 0.15 libfabric at 1.13.1 mercury at 2.0.1 
>>>>>>>>>>>>> mochi-margo at 0.9.5  rdma-core at 20
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Philip
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> mochi-devel mailing list
>>>>>>>>>>>>>
>>>>>>>>>>>>> mochi-devel at lists.mcs.anl.gov
>>>>>>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>>>>>>>>>>> https://www.mcs.anl.gov/research/projects/mochi
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> mochi-devel mailing list
>>>>>>>>>>>> mochi-devel at lists.mcs.anl.gov
>>>>>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>>>>>>>>>> https://www.mcs.anl.gov/research/projects/mochi
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211105/0f0fdb61/attachment-0001.html>