[Mochi-devel] RPC latency question

Tue Nov 2 13:24:30 CDT 2021

Ok.  Interesting.  I didn't see anything unusual in the timing on my 
laptop with sm (other than it being a bit slow, but I wasn't tuning or 
worrying about core affinity or anything).  On Cooley, a somewhat older 
Linux cluster with InfiniBand, I see the 4th and 8th RPC delay you were 
talking about:

{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent 
GUID":0}},

(assuming the first is high due to connection establishment)

I'll check some other systems/transports, but I wanted to go ahead and 
share that I've been able to reproduce what you were seeing.

thanks,

-Phil

On 11/2/21 1:49 PM, Philip Davis wrote:
> Glad that’s working now.
>
> It is the put_wait events, and “dur” is the right field. Those units 
> are microseconds.
>
>
>> On Nov 2, 2021, at 1:12 PM, Phil Carns <carns at mcs.anl.gov> wrote:
>>
>> Thanks Philip, the "= {0};" initialization of that struct got me going.
>>
>> I can run the test case now and it is producing output in the client 
>> and server perf dirs.  Just to sanity check what to look for, I think 
>> the problem should be exhibited in the "put_wait" or maybe "do_put" 
>> trace events on the client?  For example on my laptop I see this:
>>
>> carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep do_put 
>> trace_events.0.json
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent 
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent 
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent 
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent 
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent 
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent 
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent 
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent 
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent 
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent 
>> GUID":0}},
>> carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait trace_events.0.json
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent 
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent 
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent 
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent 
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent 
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent 
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent 
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent 
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent 
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent 
>> GUID":0}},
>>
>> I should look at the "dur" field right? What are the units on that?
>>
>> I'll see if I can run this on a "real" system shortly.
>>
>> thanks!
>>
>> -Phil
>>
>> On 11/2/21 12:11 PM, Philip Davis wrote:
>>> Hi Phil,
>>>
>>> Sorry the data structures are like that; I wanted to preserve as 
>>> much of the RPC size and ordering in case it ended up being important.
>>>
>>> I’m surprised in.odsc.size is troublesome, as I set in.odsc.size 
>>> with the line `in.odsc.size = sizeof(odsc);`. I’m not sure what 
>>> could be corrupting that value in the meantime.
>>>
>>> I don’t set in.odsc.gdim_size (which was an oversight, since that’s 
>>> non-zero normally), so I’m less surprised that’s an issue. I thought 
>>> I initialized `in` to zero, but I see I didn’t do that after all.
>>>
>>> Maybe change the line `bulk_gdim_t in;` to `bulk_gdim_t in = {0};`
>>>
>>>
>>>
>>>> On Nov 2, 2021, at 11:48 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>
>>>> Awesome, thanks Philip.  It came through fine.
>>>>
>>>> I started by modifying the job script slightly to just run it on my 
>>>> laptop with sm (I wanted to make sure I understood the test case, 
>>>> and how to use apex, before trying elsewhere).  Does in.size needs 
>>>> to be set in client.c? For me there is a random value in that field 
>>>> and it is causing the encoder on the forward to attempt a very 
>>>> large allocation.  The same might be true of gdim_size if it got 
>>>> past that step.  I started to alter them but then I wasn't sure 
>>>> what the implications were.
>>>>
>>>> (fwiw I needed to include stdlib.h in common.h, but I've hit that a 
>>>> couple of times recently on codes that didn't previously generate 
>>>> warnings; I think something in Ubuntu has gotten strict about that 
>>>> recently).
>>>>
>>>> thanks,
>>>>
>>>> -Phil
>>>>
>>>>
>>>> On 11/1/21 4:51 PM, Philip Davis wrote:
>>>>> Hi Phil,
>>>>>
>>>>> I’ve attached the reproducer. I see the 4th and 8th issue on 
>>>>> Frontera, but not Summit. Hopefully it will build and run without 
>>>>> too much modification. Let me know if there are any issues with 
>>>>> running it (or if the anl listserv eats the tarball, which I kind 
>>>>> of expect).
>>>>>
>>>>> Thanks,
>>>>> Philip
>>>>>
>>>>>
>>>>>> On Nov 1, 2021, at 11:14 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>>>
>>>>>> Hi Philip,
>>>>>>
>>>>>> (FYI I think the first image didn't come through in your email, 
>>>>>> but I think the others are sufficient to get across what you are 
>>>>>> seeing)
>>>>>>
>>>>>> I don't have any idea what would cause that.  The recently 
>>>>>> released libfabric 1.13.2 (available in spack from the 
>>>>>> mochi-spack-packages repo) includes some fixes to the rxm 
>>>>>> provider that could be relevant to Frontera and Summit, but 
>>>>>> nothing that aligns with what you are observing.
>>>>>>
>>>>>> If it were later in the sequence (much later) I would speculate 
>>>>>> that memory allocation/deallocation cycles were eventually 
>>>>>> causing a hiccup.  We've seen something like that in the past, 
>>>>>> and it's a theory that we could then test with alternative 
>>>>>> allocators like jemalloc. That's not memory allocation jitter 
>>>>>> that early in the run though.
>>>>>>
>>>>>> Please do share your reproducer if you don't mind!  We can try a 
>>>>>> few systems here to at least isolate if it is something peculiar 
>>>>>> to the InfiniBand path or if there is a more general problem in 
>>>>>> Margo.
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> -Phil
>>>>>>
>>>>>> On 10/29/21 3:20 PM, Philip Davis wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I apologize in advance for the winding nature of this email. I’m 
>>>>>>> not sure how to ask my question without explaining the story of 
>>>>>>> my results some.
>>>>>>>
>>>>>>> I’m doing some characterization of our server performance under 
>>>>>>> load, and I have a quirk of performance that I wanted to run by 
>>>>>>> people to see if they make sense. My testing so far has been to 
>>>>>>> iteratively send batches of RPCs using margo_iforward, and then 
>>>>>>> measure the wait time until they are all complete. On the server 
>>>>>>> side, handling the RPC includes a margo_bulk_transfer as a pull 
>>>>>>> initiated on the server to pull (for now) 8 bytes. The payload 
>>>>>>> of the RPC request is about 500 bytes, and the response payload 
>>>>>>> is 4 bytes.
>>>>>>>
>>>>>>> I’ve isolated my results down to one server rank and one client 
>>>>>>> rank, because it’s an easier starting point to reason from. 
>>>>>>> Below is a graph of some of my initial results. These results 
>>>>>>> are from Frontera. The median times are good (a single RPC takes 
>>>>>>> on the order of 10s of microseconds, which seems fantastic). 
>>>>>>> However, the outliers are fairly high (note the log scale of the 
>>>>>>> y-axis). With only one RPC per timestep, for example, there is a 
>>>>>>> 30x spread between the median and the max.
>>>>>>>
>>>>>>>
>>>>>>> I was hoping (expecting) the first timestep would be where the 
>>>>>>> long replies resided, but that turned out not to be the case. 
>>>>>>> Below are traces from the 1 RPC (blue) and 2 RPC  (orange) per 
>>>>>>> timestep cases, 5 trials of 10 timesteps for each case 
>>>>>>> (normalized to fix the same x-axis):
>>>>>>>
>>>>>>> <PastedGraphic-6.png>
>>>>>>>
>>>>>>> What strikes me is how consistent these results are across 
>>>>>>> trials. For the 1 RPC per timestep case, the 3rd and 7th 
>>>>>>> timestep are consistently//slow (and the rest are fast). For the 
>>>>>>> 2 RPC per timestep case, the 2nd and 4th timestep are always 
>>>>>>> slow and sometimes the 10th is. These results are repeatable 
>>>>>>> with very rare variation.
>>>>>>>
>>>>>>> For the single RPC case, I recorded some timers on the server 
>>>>>>> side, and attempted to overlay them with the client side (there 
>>>>>>> is some unknown offset, but probably on the order of 10s of 
>>>>>>> microseconds at worst, given the pattern):
>>>>>>>
>>>>>>> <PastedGraphic-7.png>
>>>>>>>
>>>>>>> I blew up the first few timesteps of one of the trials:
>>>>>>> <PastedGraphic-8.png>
>>>>>>>
>>>>>>> The different colors are different segments of the handler, but 
>>>>>>> there doesn’t seem to be anything too interesting going on 
>>>>>>> inside the handler. So it looks like the time is being 
>>>>>>> introduced before the 3rd RPC handler starts, based on the where 
>>>>>>> the gap appears on the server side.
>>>>>>>
>>>>>>> To try and isolate any dataspaces-specific behavior, I created a 
>>>>>>> pure Margo test case that just sends a single rpc of the same 
>>>>>>> size as dataspaces iteratively, whre the server side does an 
>>>>>>> 8-byte bulk transfer initiated by the server, and sends a 
>>>>>>> response. The results are similar, except that it is now the 4th 
>>>>>>> and 8th timestep that are slow (and the first timestep is VERY 
>>>>>>> long, presumably because rxm communication state is being 
>>>>>>> established. DataSpaces has an earlier RPC in its init that was 
>>>>>>> absorbing this latency).
>>>>>>>
>>>>>>> I got margo profiling results for this test case:
>>>>>>>
>>>>>>> ```
>>>>>>> 3
>>>>>>> 18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573
>>>>>>> 0xa2a1,term_rpc
>>>>>>> 0x27b5,put_rpc
>>>>>>> 0xd320,__shutdown__
>>>>>>> 0x27b5 
>>>>>>> ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>>>> 0x27b5 ,0;0.041241646,200.000000000, 0;
>>>>>>> 0xa2a1 
>>>>>>> ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>>>> 0xa2a1 ,0;0.000009298,1.000000000, 0;
>>>>>>> ```
>>>>>>>
>>>>>>> So I guess my question at this point is, is there any sensible 
>>>>>>> reason why the 4th and 8th RPC sent would have a long response 
>>>>>>> time? I think I’ve cleared my code on the client side and server 
>>>>>>> side, so it appears to be latency being introduced by Margo, 
>>>>>>> LibFabric, Argobots, or the underlying OS. I do see long 
>>>>>>> timesteps occasionally after this (perhaps every 20-30 
>>>>>>> timesteps) but these are not consistent.
>>>>>>>
>>>>>>> One last detail: this does not happen on Summit. On summit, I 
>>>>>>> see about 5-7x worse single-RPC performance (250-350 
>>>>>>> microseconds per RPC), but without the intermittent long timesteps.
>>>>>>>
>>>>>>> I can provide the minimal test case if it would be helpful. I am 
>>>>>>> using APEX for timing results, and the following dependencies 
>>>>>>> with Spack:
>>>>>>>
>>>>>>> argobots at 1.1 <mailto:argobots at 1.1> json-c at 0.15 
>>>>>>> <mailto:json-c at 0.15> libfabric at 1.13.1 <mailto:libfabric at 1.13.1> 
>>>>>>> mercury at 2.0.1 <mailto:mercury at 2.0.1> mochi-margo at 0.9.5 
>>>>>>> <mailto:mochi-margo at 0.9.5>  rdma-core at 20
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Philip
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mochi-devel mailing list
>>>>>>> mochi-devel at lists.mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel  <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C072a060f86104dba349208d99e23ee3f%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714700152332372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=MsfzALh61gWJDkR%2BoE3DW1%2Fc%2FZod8mPoK09NTlo85VQ%3D&reserved=0>
>>>>>>> https://www.mcs.anl.gov/research/projects/mochi  <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C072a060f86104dba349208d99e23ee3f%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714700152342375%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=amzR%2ByhZ2gjG8Drk20bR5iB8u5lZeiGmnWr%2FJShdGXA%3D&reserved=0>
>>>>>> _______________________________________________
>>>>>> mochi-devel mailing list
>>>>>> mochi-devel at lists.mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>>>> https://www.mcs.anl.gov/research/projects/mochi
>>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211102/302e9a9f/attachment-0001.html>