[Mochi-devel] RPC latency question
Phil Carns
carns at mcs.anl.gov
Tue Nov 2 13:24:30 CDT 2021
Ok. Interesting. I didn't see anything unusual in the timing on my
laptop with sm (other than it being a bit slow, but I wasn't tuning or
worrying about core affinity or anything). On Cooley, a somewhat older
Linux cluster with InfiniBand, I see the 4th and 8th RPC delay you were
talking about:
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047385464.750000,"dur":33077.620054,"args":{"GUID":3,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047418850.000000,"dur":458.322054,"args":{"GUID":5,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419519.250000,"dur":205.328054,"args":{"GUID":7,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047419939.500000,"dur":2916.470054,"args":{"GUID":9,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423046.750000,"dur":235.460054,"args":{"GUID":11,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423426.000000,"dur":208.722054,"args":{"GUID":13,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047423809.000000,"dur":155.962054,"args":{"GUID":15,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047424096.250000,"dur":3573.288054,"args":{"GUID":17,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047427857.000000,"dur":243.386054,"args":{"GUID":19,"Parent
GUID":0}},
{"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635877047428328.000000,"dur":154.338054,"args":{"GUID":21,"Parent
GUID":0}},
(assuming the first is high due to connection establishment)
I'll check some other systems/transports, but I wanted to go ahead and
share that I've been able to reproduce what you were seeing.
thanks,
-Phil
On 11/2/21 1:49 PM, Philip Davis wrote:
> Glad that’s working now.
>
> It is the put_wait events, and “dur” is the right field. Those units
> are microseconds.
>
>
>> On Nov 2, 2021, at 1:12 PM, Phil Carns <carns at mcs.anl.gov> wrote:
>>
>> Thanks Philip, the "= {0};" initialization of that struct got me going.
>>
>> I can run the test case now and it is producing output in the client
>> and server perf dirs. Just to sanity check what to look for, I think
>> the problem should be exhibited in the "put_wait" or maybe "do_put"
>> trace events on the client? For example on my laptop I see this:
>>
>> carns-x1-7g ~/w/d/d/m/client.perf [SIGINT]> grep do_put
>> trace_events.0.json
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352591977.250000,"dur":350.464053,"args":{"GUID":2,"Parent
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593065.000000,"dur":36.858053,"args":{"GUID":4,"Parent
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593617.000000,"dur":32.954053,"args":{"GUID":6,"Parent
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594193.000000,"dur":36.026053,"args":{"GUID":8,"Parent
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594850.750000,"dur":34.404053,"args":{"GUID":10,"Parent
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595400.750000,"dur":33.524053,"args":{"GUID":12,"Parent
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595927.500000,"dur":34.390053,"args":{"GUID":14,"Parent
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596416.000000,"dur":37.922053,"args":{"GUID":16,"Parent
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596870.000000,"dur":35.506053,"args":{"GUID":18,"Parent
>> GUID":0}},
>> {"name":"do_put","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597287.500000,"dur":34.774053,"args":{"GUID":20,"Parent
>> GUID":0}},
>> carns-x1-7g ~/w/d/d/m/client.perf> grep put_wait trace_events.0.json
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352592427.750000,"dur":570.428053,"args":{"GUID":3,"Parent
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593122.750000,"dur":429.156053,"args":{"GUID":5,"Parent
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352593671.250000,"dur":465.616053,"args":{"GUID":7,"Parent
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594248.500000,"dur":547.054053,"args":{"GUID":9,"Parent
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352594906.750000,"dur":428.964053,"args":{"GUID":11,"Parent
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595455.750000,"dur":416.796053,"args":{"GUID":13,"Parent
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352595981.250000,"dur":371.040053,"args":{"GUID":15,"Parent
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596485.500000,"dur":334.758053,"args":{"GUID":17,"Parent
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352596934.250000,"dur":298.168053,"args":{"GUID":19,"Parent
>> GUID":0}},
>> {"name":"put_wait","cat":"CPU","ph":"X","pid":0,"tid":0,"ts":1635872352597342.250000,"dur":389.624053,"args":{"GUID":21,"Parent
>> GUID":0}},
>>
>> I should look at the "dur" field right? What are the units on that?
>>
>> I'll see if I can run this on a "real" system shortly.
>>
>> thanks!
>>
>> -Phil
>>
>> On 11/2/21 12:11 PM, Philip Davis wrote:
>>> Hi Phil,
>>>
>>> Sorry the data structures are like that; I wanted to preserve as
>>> much of the RPC size and ordering in case it ended up being important.
>>>
>>> I’m surprised in.odsc.size is troublesome, as I set in.odsc.size
>>> with the line `in.odsc.size = sizeof(odsc);`. I’m not sure what
>>> could be corrupting that value in the meantime.
>>>
>>> I don’t set in.odsc.gdim_size (which was an oversight, since that’s
>>> non-zero normally), so I’m less surprised that’s an issue. I thought
>>> I initialized `in` to zero, but I see I didn’t do that after all.
>>>
>>> Maybe change the line `bulk_gdim_t in;` to `bulk_gdim_t in = {0};`
>>>
>>>
>>>
>>>> On Nov 2, 2021, at 11:48 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>
>>>> Awesome, thanks Philip. It came through fine.
>>>>
>>>> I started by modifying the job script slightly to just run it on my
>>>> laptop with sm (I wanted to make sure I understood the test case,
>>>> and how to use apex, before trying elsewhere). Does in.size needs
>>>> to be set in client.c? For me there is a random value in that field
>>>> and it is causing the encoder on the forward to attempt a very
>>>> large allocation. The same might be true of gdim_size if it got
>>>> past that step. I started to alter them but then I wasn't sure
>>>> what the implications were.
>>>>
>>>> (fwiw I needed to include stdlib.h in common.h, but I've hit that a
>>>> couple of times recently on codes that didn't previously generate
>>>> warnings; I think something in Ubuntu has gotten strict about that
>>>> recently).
>>>>
>>>> thanks,
>>>>
>>>> -Phil
>>>>
>>>>
>>>> On 11/1/21 4:51 PM, Philip Davis wrote:
>>>>> Hi Phil,
>>>>>
>>>>> I’ve attached the reproducer. I see the 4th and 8th issue on
>>>>> Frontera, but not Summit. Hopefully it will build and run without
>>>>> too much modification. Let me know if there are any issues with
>>>>> running it (or if the anl listserv eats the tarball, which I kind
>>>>> of expect).
>>>>>
>>>>> Thanks,
>>>>> Philip
>>>>>
>>>>>
>>>>>> On Nov 1, 2021, at 11:14 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>>>>>
>>>>>> Hi Philip,
>>>>>>
>>>>>> (FYI I think the first image didn't come through in your email,
>>>>>> but I think the others are sufficient to get across what you are
>>>>>> seeing)
>>>>>>
>>>>>> I don't have any idea what would cause that. The recently
>>>>>> released libfabric 1.13.2 (available in spack from the
>>>>>> mochi-spack-packages repo) includes some fixes to the rxm
>>>>>> provider that could be relevant to Frontera and Summit, but
>>>>>> nothing that aligns with what you are observing.
>>>>>>
>>>>>> If it were later in the sequence (much later) I would speculate
>>>>>> that memory allocation/deallocation cycles were eventually
>>>>>> causing a hiccup. We've seen something like that in the past,
>>>>>> and it's a theory that we could then test with alternative
>>>>>> allocators like jemalloc. That's not memory allocation jitter
>>>>>> that early in the run though.
>>>>>>
>>>>>> Please do share your reproducer if you don't mind! We can try a
>>>>>> few systems here to at least isolate if it is something peculiar
>>>>>> to the InfiniBand path or if there is a more general problem in
>>>>>> Margo.
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> -Phil
>>>>>>
>>>>>> On 10/29/21 3:20 PM, Philip Davis wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I apologize in advance for the winding nature of this email. I’m
>>>>>>> not sure how to ask my question without explaining the story of
>>>>>>> my results some.
>>>>>>>
>>>>>>> I’m doing some characterization of our server performance under
>>>>>>> load, and I have a quirk of performance that I wanted to run by
>>>>>>> people to see if they make sense. My testing so far has been to
>>>>>>> iteratively send batches of RPCs using margo_iforward, and then
>>>>>>> measure the wait time until they are all complete. On the server
>>>>>>> side, handling the RPC includes a margo_bulk_transfer as a pull
>>>>>>> initiated on the server to pull (for now) 8 bytes. The payload
>>>>>>> of the RPC request is about 500 bytes, and the response payload
>>>>>>> is 4 bytes.
>>>>>>>
>>>>>>> I’ve isolated my results down to one server rank and one client
>>>>>>> rank, because it’s an easier starting point to reason from.
>>>>>>> Below is a graph of some of my initial results. These results
>>>>>>> are from Frontera. The median times are good (a single RPC takes
>>>>>>> on the order of 10s of microseconds, which seems fantastic).
>>>>>>> However, the outliers are fairly high (note the log scale of the
>>>>>>> y-axis). With only one RPC per timestep, for example, there is a
>>>>>>> 30x spread between the median and the max.
>>>>>>>
>>>>>>>
>>>>>>> I was hoping (expecting) the first timestep would be where the
>>>>>>> long replies resided, but that turned out not to be the case.
>>>>>>> Below are traces from the 1 RPC (blue) and 2 RPC (orange) per
>>>>>>> timestep cases, 5 trials of 10 timesteps for each case
>>>>>>> (normalized to fix the same x-axis):
>>>>>>>
>>>>>>> <PastedGraphic-6.png>
>>>>>>>
>>>>>>> What strikes me is how consistent these results are across
>>>>>>> trials. For the 1 RPC per timestep case, the 3rd and 7th
>>>>>>> timestep are consistently//slow (and the rest are fast). For the
>>>>>>> 2 RPC per timestep case, the 2nd and 4th timestep are always
>>>>>>> slow and sometimes the 10th is. These results are repeatable
>>>>>>> with very rare variation.
>>>>>>>
>>>>>>> For the single RPC case, I recorded some timers on the server
>>>>>>> side, and attempted to overlay them with the client side (there
>>>>>>> is some unknown offset, but probably on the order of 10s of
>>>>>>> microseconds at worst, given the pattern):
>>>>>>>
>>>>>>> <PastedGraphic-7.png>
>>>>>>>
>>>>>>> I blew up the first few timesteps of one of the trials:
>>>>>>> <PastedGraphic-8.png>
>>>>>>>
>>>>>>> The different colors are different segments of the handler, but
>>>>>>> there doesn’t seem to be anything too interesting going on
>>>>>>> inside the handler. So it looks like the time is being
>>>>>>> introduced before the 3rd RPC handler starts, based on the where
>>>>>>> the gap appears on the server side.
>>>>>>>
>>>>>>> To try and isolate any dataspaces-specific behavior, I created a
>>>>>>> pure Margo test case that just sends a single rpc of the same
>>>>>>> size as dataspaces iteratively, whre the server side does an
>>>>>>> 8-byte bulk transfer initiated by the server, and sends a
>>>>>>> response. The results are similar, except that it is now the 4th
>>>>>>> and 8th timestep that are slow (and the first timestep is VERY
>>>>>>> long, presumably because rxm communication state is being
>>>>>>> established. DataSpaces has an earlier RPC in its init that was
>>>>>>> absorbing this latency).
>>>>>>>
>>>>>>> I got margo profiling results for this test case:
>>>>>>>
>>>>>>> ```
>>>>>>> 3
>>>>>>> 18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573
>>>>>>> 0xa2a1,term_rpc
>>>>>>> 0x27b5,put_rpc
>>>>>>> 0xd320,__shutdown__
>>>>>>> 0x27b5
>>>>>>> ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>>>> 0x27b5 ,0;0.041241646,200.000000000, 0;
>>>>>>> 0xa2a1
>>>>>>> ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>>>>>> 0xa2a1 ,0;0.000009298,1.000000000, 0;
>>>>>>> ```
>>>>>>>
>>>>>>> So I guess my question at this point is, is there any sensible
>>>>>>> reason why the 4th and 8th RPC sent would have a long response
>>>>>>> time? I think I’ve cleared my code on the client side and server
>>>>>>> side, so it appears to be latency being introduced by Margo,
>>>>>>> LibFabric, Argobots, or the underlying OS. I do see long
>>>>>>> timesteps occasionally after this (perhaps every 20-30
>>>>>>> timesteps) but these are not consistent.
>>>>>>>
>>>>>>> One last detail: this does not happen on Summit. On summit, I
>>>>>>> see about 5-7x worse single-RPC performance (250-350
>>>>>>> microseconds per RPC), but without the intermittent long timesteps.
>>>>>>>
>>>>>>> I can provide the minimal test case if it would be helpful. I am
>>>>>>> using APEX for timing results, and the following dependencies
>>>>>>> with Spack:
>>>>>>>
>>>>>>> argobots at 1.1 <mailto:argobots at 1.1> json-c at 0.15
>>>>>>> <mailto:json-c at 0.15> libfabric at 1.13.1 <mailto:libfabric at 1.13.1>
>>>>>>> mercury at 2.0.1 <mailto:mercury at 2.0.1> mochi-margo at 0.9.5
>>>>>>> <mailto:mochi-margo at 0.9.5> rdma-core at 20
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Philip
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mochi-devel mailing list
>>>>>>> mochi-devel at lists.mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C072a060f86104dba349208d99e23ee3f%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714700152332372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=MsfzALh61gWJDkR%2BoE3DW1%2Fc%2FZod8mPoK09NTlo85VQ%3D&reserved=0>
>>>>>>> https://www.mcs.anl.gov/research/projects/mochi <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7C072a060f86104dba349208d99e23ee3f%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637714700152342375%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=amzR%2ByhZ2gjG8Drk20bR5iB8u5lZeiGmnWr%2FJShdGXA%3D&reserved=0>
>>>>>> _______________________________________________
>>>>>> mochi-devel mailing list
>>>>>> mochi-devel at lists.mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>>>> https://www.mcs.anl.gov/research/projects/mochi
>>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211102/302e9a9f/attachment-0001.html>
More information about the mochi-devel
mailing list