[Mochi-devel] RPC latency question
Phil Carns
carns at mcs.anl.gov
Tue Nov 2 12:12:10 CDT 2021
Thanks Philip, the "= {0};" initialization of that struct got me going.
I can run the test case now and it is producing output in the client and
server perf dirs. Just to sanity check what to look for, I think the
problem should be exhibited in the "put_wait" or maybe "do_put" trace
events on the client? For example on my laptop I see this:
carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep do_put trace_events.0.json
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent
GUID":0}},
{"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent
GUID":0}},
carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait trace_events.0.json
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent
GUID":0}},
I should look at the "dur" field right? What are the units on that?
I'll see if I can run this on a "real" system shortly.
thanks!
-Phil
On 11/2/21 12:11 PM, Philip Davis wrote:
> Hi Phil,
>
> Sorry the data structures are like that; I wanted to preserve as much
> of the RPC size and ordering in case it ended up being important.
>
> I’m surprised in.odsc.size is troublesome, as I set in.odsc.size with
> the line `in.odsc.size = sizeof(odsc);`. I’m not sure what could be
> corrupting that value in the meantime.
>
> I don’t set in.odsc.gdim_size (which was an oversight, since that’s
> non-zero normally), so I’m less surprised that’s an issue. I thought I
> initialized `in` to zero, but I see I didn’t do that after all.
>
> Maybe change the line `bulk_gdim_t in;` to `bulk_gdim_t in = {0};`
>
>
>
>> On Nov 2, 2021, at 11:48 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>
>> Awesome, thanks Philip. It came through fine.
>>
>> I started by modifying the job script slightly to just run it on my
>> laptop with sm (I wanted to make sure I understood the test case, and
>> how to use apex, before trying elsewhere). Does in.size needs to be
>> set in client.c? For me there is a random value in that field and it
>> is causing the encoder on the forward to attempt a very large
>> allocation. The same might be true of gdim_size if it got past that
>> step. I started to alter them but then I wasn't sure what the
>> implications were.
>>
>> (fwiw I needed to include stdlib.h in common.h, but I've hit that a
>> couple of times recently on codes that didn't previously generate
>> warnings; I think something in Ubuntu has gotten strict about that
>> recently).
>>
>> thanks,
>>
>> -Phil
>>
>>
>> On 11/1/21 4:51 PM, Philip Davis wrote:
>>> Hi Phil,
>>>
>>> I’ve attached the reproducer. I see the 4th and 8th issue on
>>> Frontera, but not Summit. Hopefully it will build and run without
>>> too much modification. Let me know if there are any issues with
>>> running it (or if the anl listserv eats the tarball, which I kind of
>>> expect).
>>>
>>> Thanks,
>>> Philip
>>>
>>>
>>>> On Nov 1, 2021, at 11:14 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>
>>>> Hi Philip,
>>>>
>>>> (FYI I think the first image didn't come through in your email, but
>>>> I think the others are sufficient to get across what you are seeing)
>>>>
>>>> I don't have any idea what would cause that. The recently released
>>>> libfabric 1.13.2 (available in spack from the mochi-spack-packages
>>>> repo) includes some fixes to the rxm provider that could be
>>>> relevant to Frontera and Summit, but nothing that aligns with what
>>>> you are observing.
>>>>
>>>> If it were later in the sequence (much later) I would speculate
>>>> that memory allocation/deallocation cycles were eventually causing
>>>> a hiccup. We've seen something like that in the past, and it's a
>>>> theory that we could then test with alternative allocators like
>>>> jemalloc. That's not memory allocation jitter that early in the
>>>> run though.
>>>>
>>>> Please do share your reproducer if you don't mind! We can try a
>>>> few systems here to at least isolate if it is something peculiar to
>>>> the InfiniBand path or if there is a more general problem in Margo.
>>>>
>>>> thanks,
>>>>
>>>> -Phil
>>>>
>>>> On 10/29/21 3:20 PM, Philip Davis wrote:
>>>>> Hello,
>>>>>
>>>>> I apologize in advance for the winding nature of this email. I’m
>>>>> not sure how to ask my question without explaining the story of my
>>>>> results some.
>>>>>
>>>>> I’m doing some characterization of our server performance under
>>>>> load, and I have a quirk of performance that I wanted to run by
>>>>> people to see if they make sense. My testing so far has been to
>>>>> iteratively send batches of RPCs using margo_iforward, and then
>>>>> measure the wait time until they are all complete. On the server
>>>>> side, handling the RPC includes a margo_bulk_transfer as a pull
>>>>> initiated on the server to pull (for now) 8 bytes. The payload of
>>>>> the RPC request is about 500 bytes, and the response payload is 4
>>>>> bytes.
>>>>>
>>>>> I’ve isolated my results down to one server rank and one client
>>>>> rank, because it’s an easier starting point to reason from. Below
>>>>> is a graph of some of my initial results. These results are from
>>>>> Frontera. The median times are good (a single RPC takes on the
>>>>> order of 10s of microseconds, which seems fantastic). However, the
>>>>> outliers are fairly high (note the log scale of the y-axis). With
>>>>> only one RPC per timestep, for example, there is a 30x spread
>>>>> between the median and the max.
>>>>>
>>>>>
>>>>> I was hoping (expecting) the first timestep would be where the
>>>>> long replies resided, but that turned out not to be the case.
>>>>> Below are traces from the 1 RPC (blue) and 2 RPC (orange) per
>>>>> timestep cases, 5 trials of 10 timesteps for each case (normalized
>>>>> to fix the same x-axis):
>>>>>
>>>>> <PastedGraphic-6.png>
>>>>>
>>>>> What strikes me is how consistent these results are across trials.
>>>>> For the 1 RPC per timestep case, the 3rd and 7th timestep are
>>>>> consistently//slow (and the rest are fast). For the 2 RPC per
>>>>> timestep case, the 2nd and 4th timestep are always slow and
>>>>> sometimes the 10th is. These results are repeatable with very rare
>>>>> variation.
>>>>>
>>>>> For the single RPC case, I recorded some timers on the server
>>>>> side, and attempted to overlay them with the client side (there is
>>>>> some unknown offset, but probably on the order of 10s of
>>>>> microseconds at worst, given the pattern):
>>>>>
>>>>> <PastedGraphic-7.png>
>>>>>
>>>>> I blew up the first few timesteps of one of the trials:
>>>>> <PastedGraphic-8.png>
>>>>>
>>>>> The different colors are different segments of the handler, but
>>>>> there doesn’t seem to be anything too interesting going on inside
>>>>> the handler. So it looks like the time is being introduced before
>>>>> the 3rd RPC handler starts, based on the where the gap appears on
>>>>> the server side.
>>>>>
>>>>> To try and isolate any dataspaces-specific behavior, I created a
>>>>> pure Margo test case that just sends a single rpc of the same size
>>>>> as dataspaces iteratively, whre the server side does an 8-byte
>>>>> bulk transfer initiated by the server, and sends a response. The
>>>>> results are similar, except that it is now the 4th and 8th
>>>>> timestep that are slow (and the first timestep is VERY long,
>>>>> presumably because rxm communication state is being established.
>>>>> DataSpaces has an earlier RPC in its init that was absorbing this
>>>>> latency).
>>>>>
>>>>> I got margo profiling results for this test case:
>>>>>
>>>>> ```
>>>>> 3
>>>>> 18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573
>>>>> 0xa2a1,term_rpc
>>>>> 0x27b5,put_rpc
>>>>> 0xd320,__shutdown__
>>>>> 0x27b5
>>>>> ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>> 0x27b5 ,0;0.041241646,200.000000000, 0;
>>>>> 0xa2a1
>>>>> ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>> 0xa2a1 ,0;0.000009298,1.000000000, 0;
>>>>> ```
>>>>>
>>>>> So I guess my question at this point is, is there any sensible
>>>>> reason why the 4th and 8th RPC sent would have a long response
>>>>> time? I think I’ve cleared my code on the client side and server
>>>>> side, so it appears to be latency being introduced by Margo,
>>>>> LibFabric, Argobots, or the underlying OS. I do see long timesteps
>>>>> occasionally after this (perhaps every 20-30 timesteps) but these
>>>>> are not consistent.
>>>>>
>>>>> One last detail: this does not happen on Summit. On summit, I see
>>>>> about 5-7x worse single-RPC performance (250-350 microseconds per
>>>>> RPC), but without the intermittent long timesteps.
>>>>>
>>>>> I can provide the minimal test case if it would be helpful. I am
>>>>> using APEX for timing results, and the following dependencies with
>>>>> Spack:
>>>>>
>>>>> argobots at 1.1 json-c at 0.15 libfabric at 1.13.1 mercury at 2.0.1
>>>>> mochi-margo at 0.9.5 rdma-core at 20
>>>>>
>>>>> Thanks,
>>>>> Philip
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mochi-devel mailing list
>>>>> mochi-devel at lists.mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202547739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=dpAo6xraOBJUBYZl9u%2FTKlgE0SxgsPHA%2BkcLHGUlRzE%3D&reserved=0>
>>>>> https://www.mcs.anl.gov/research/projects/mochi <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C7f152c3f72534603d01008d99e184d07%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714650202557723%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AkqAutFJvngA4OtsQa4vLGyAgddum5BXjpwqgxXqjtw%3D&reserved=0>
>>>> _______________________________________________
>>>> mochi-devel mailing list
>>>> mochi-devel at lists.mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>> https://www.mcs.anl.gov/research/projects/mochi
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211102/450a9c9e/attachment-0001.html>
More information about the mochi-devel
mailing list