[Mochi-devel] RPC latency question

Tue Nov 2 12:12:10 CDT 2021

Thanks Philip, the "= {0};" initialization of that struct got me going.

I can run the test case now and it is producing output in the client and 
server perf dirs.  Just to sanity check what to look for, I think the 
problem should be exhibited in the "put_wait" or maybe "do_put" trace 
events on the client?  For example on my laptop I see this:

carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep do_put trace_events.0.json
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent 
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent 
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent 
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent 
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent 
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent 
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent 
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent 
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent 
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent 
GUID":0}},
carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait trace_events.0.json
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent 
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent 
GUID":0}},

I should look at the "dur" field right?  What are the units on that?

I'll see if I can run this on a "real" system shortly.

thanks!

-Phil

On 11/2/21 12:11 PM, Philip Davis wrote:
> Hi Phil,
>
> Sorry the data structures are like that; I wanted to preserve as much 
> of the RPC size and ordering in case it ended up being important.
>
> I’m surprised in.odsc.size is troublesome, as I set in.odsc.size with 
> the line `in.odsc.size = sizeof(odsc);`. I’m not sure what could be 
> corrupting that value in the meantime.
>
> I don’t set in.odsc.gdim_size (which was an oversight, since that’s 
> non-zero normally), so I’m less surprised that’s an issue. I thought I 
> initialized `in` to zero, but I see I didn’t do that after all.
>
> Maybe change the line `bulk_gdim_t in;` to `bulk_gdim_t in = {0};`
>
>
>
>> On Nov 2, 2021, at 11:48 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>
>> Awesome, thanks Philip.  It came through fine.
>>
>> I started by modifying the job script slightly to just run it on my 
>> laptop with sm (I wanted to make sure I understood the test case, and 
>> how to use apex, before trying elsewhere).  Does in.size needs to be 
>> set in client.c?  For me there is a random value in that field and it 
>> is causing the encoder on the forward to attempt a very large 
>> allocation.  The same might be true of gdim_size if it got past that 
>> step.  I started to alter them but then I wasn't sure what the 
>> implications were.
>>
>> (fwiw I needed to include stdlib.h in common.h, but I've hit that a 
>> couple of times recently on codes that didn't previously generate 
>> warnings; I think something in Ubuntu has gotten strict about that 
>> recently).
>>
>> thanks,
>>
>> -Phil
>>
>>
>> On 11/1/21 4:51 PM, Philip Davis wrote:
>>> Hi Phil,
>>>
>>> I’ve attached the reproducer. I see the 4th and 8th issue on 
>>> Frontera, but not Summit. Hopefully it will build and run without 
>>> too much modification. Let me know if there are any issues with 
>>> running it (or if the anl listserv eats the tarball, which I kind of 
>>> expect).
>>>
>>> Thanks,
>>> Philip
>>>
>>>
>>>> On Nov 1, 2021, at 11:14 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>
>>>> Hi Philip,
>>>>
>>>> (FYI I think the first image didn't come through in your email, but 
>>>> I think the others are sufficient to get across what you are seeing)
>>>>
>>>> I don't have any idea what would cause that.  The recently released 
>>>> libfabric 1.13.2 (available in spack from the mochi-spack-packages 
>>>> repo) includes some fixes to the rxm provider that could be 
>>>> relevant to Frontera and Summit, but nothing that aligns with what 
>>>> you are observing.
>>>>
>>>> If it were later in the sequence (much later) I would speculate 
>>>> that memory allocation/deallocation cycles were eventually causing 
>>>> a hiccup.  We've seen something like that in the past, and it's a 
>>>> theory that we could then test with alternative allocators like 
>>>> jemalloc.  That's not memory allocation jitter that early in the 
>>>> run though.
>>>>
>>>> Please do share your reproducer if you don't mind!  We can try a 
>>>> few systems here to at least isolate if it is something peculiar to 
>>>> the InfiniBand path or if there is a more general problem in Margo.
>>>>
>>>> thanks,
>>>>
>>>> -Phil
>>>>
>>>> On 10/29/21 3:20 PM, Philip Davis wrote:
>>>>> Hello,
>>>>>
>>>>> I apologize in advance for the winding nature of this email. I’m 
>>>>> not sure how to ask my question without explaining the story of my 
>>>>> results some.
>>>>>
>>>>> I’m doing some characterization of our server performance under 
>>>>> load, and I have a quirk of performance that I wanted to run by 
>>>>> people to see if they make sense. My testing so far has been to 
>>>>> iteratively send batches of RPCs using margo_iforward, and then 
>>>>> measure the wait time until they are all complete. On the server 
>>>>> side, handling the RPC includes a margo_bulk_transfer as a pull 
>>>>> initiated on the server to pull (for now) 8 bytes. The payload of 
>>>>> the RPC request is about 500 bytes, and the response payload is 4 
>>>>> bytes.
>>>>>
>>>>> I’ve isolated my results down to one server rank and one client 
>>>>> rank, because it’s an easier starting point to reason from. Below 
>>>>> is a graph of some of my initial results. These results are from 
>>>>> Frontera. The median times are good (a single RPC takes on the 
>>>>> order of 10s of microseconds, which seems fantastic). However, the 
>>>>> outliers are fairly high (note the log scale of the y-axis). With 
>>>>> only one RPC per timestep, for example, there is a 30x spread 
>>>>> between the median and the max.
>>>>>
>>>>>
>>>>> I was hoping (expecting) the first timestep would be where the 
>>>>> long replies resided, but that turned out not to be the case. 
>>>>> Below are traces from the 1 RPC (blue) and 2 RPC  (orange) per 
>>>>> timestep cases, 5 trials of 10 timesteps for each case (normalized 
>>>>> to fix the same x-axis):
>>>>>
>>>>> <PastedGraphic-6.png>
>>>>>
>>>>> What strikes me is how consistent these results are across trials. 
>>>>> For the 1 RPC per timestep case, the 3rd and 7th timestep are 
>>>>> consistently//slow (and the rest are fast). For the 2 RPC per 
>>>>> timestep case, the 2nd and 4th timestep are always slow and 
>>>>> sometimes the 10th is. These results are repeatable with very rare 
>>>>> variation.
>>>>>
>>>>> For the single RPC case, I recorded some timers on the server 
>>>>> side, and attempted to overlay them with the client side (there is 
>>>>> some unknown offset, but probably on the order of 10s of 
>>>>> microseconds at worst, given the pattern):
>>>>>
>>>>> <PastedGraphic-7.png>
>>>>>
>>>>> I blew up the first few timesteps of one of the trials:
>>>>> <PastedGraphic-8.png>
>>>>>
>>>>> The different colors are different segments of the handler, but 
>>>>> there doesn’t seem to be anything too interesting going on inside 
>>>>> the handler. So it looks like the time is being introduced before 
>>>>> the 3rd RPC handler starts, based on the where the gap appears on 
>>>>> the server side.
>>>>>
>>>>> To try and isolate any dataspaces-specific behavior, I created a 
>>>>> pure Margo test case that just sends a single rpc of the same size 
>>>>> as dataspaces iteratively, whre the server side does an 8-byte 
>>>>> bulk transfer initiated by the server, and sends a response. The 
>>>>> results are similar, except that it is now the 4th and 8th 
>>>>> timestep that are slow (and the first timestep is VERY long, 
>>>>> presumably because rxm communication state is being established. 
>>>>> DataSpaces has an earlier RPC in its init that was absorbing this 
>>>>> latency).
>>>>>
>>>>> I got margo profiling results for this test case:
>>>>>
>>>>> ```
>>>>> 3
>>>>> 18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573
>>>>> 0xa2a1,term_rpc
>>>>> 0x27b5,put_rpc
>>>>> 0xd320,__shutdown__
>>>>> 0x27b5 
>>>>> ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>> 0x27b5 ,0;0.041241646,200.000000000, 0;
>>>>> 0xa2a1 
>>>>> ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>> 0xa2a1 ,0;0.000009298,1.000000000, 0;
>>>>> ```
>>>>>
>>>>> So I guess my question at this point is, is there any sensible 
>>>>> reason why the 4th and 8th RPC sent would have a long response 
>>>>> time? I think I’ve cleared my code on the client side and server 
>>>>> side, so it appears to be latency being introduced by Margo, 
>>>>> LibFabric, Argobots, or the underlying OS. I do see long timesteps 
>>>>> occasionally after this (perhaps every 20-30 timesteps) but these 
>>>>> are not consistent.
>>>>>
>>>>> One last detail: this does not happen on Summit. On summit, I see 
>>>>> about 5-7x worse single-RPC performance (250-350 microseconds per 
>>>>> RPC), but without the intermittent long timesteps.
>>>>>
>>>>> I can provide the minimal test case if it would be helpful. I am 
>>>>> using APEX for timing results, and the following dependencies with 
>>>>> Spack:
>>>>>
>>>>> argobots at 1.1 json-c at 0.15 libfabric at 1.13.1 mercury at 2.0.1 
>>>>> mochi-margo at 0.9.5  rdma-core at 20
>>>>>
>>>>> Thanks,
>>>>> Philip
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mochi-devel mailing list
>>>>> mochi-devel at lists.mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel  <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202547739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=dpAo6xraOBJUBYZl9u%2FTKlgE0SxgsPHA%2BkcLHGUlRzE%3D&reserved=0>
>>>>> https://www.mcs.anl.gov/research/projects/mochi  <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202557723%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AkqAutFJvngA4OtsQa4vLGyAgddum5BXjpwqgxXqjtw%3D&reserved=0>
>>>> _______________________________________________
>>>> mochi-devel mailing list
>>>> mochi-devel at lists.mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>> https://www.mcs.anl.gov/research/projects/mochi
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211102/450a9c9e/attachment-0001.html>