[Mochi-devel] RPC latency question
Phil Carns
carns at mcs.anl.gov
Tue Nov 2 10:48:57 CDT 2021
Awesome, thanks Philip. It came through fine.
I started by modifying the job script slightly to just run it on my
laptop with sm (I wanted to make sure I understood the test case, and
how to use apex, before trying elsewhere). Does in.size needs to be set
in client.c? For me there is a random value in that field and it is
causing the encoder on the forward to attempt a very large allocation.
The same might be true of gdim_size if it got past that step. I started
to alter them but then I wasn't sure what the implications were.
(fwiw I needed to include stdlib.h in common.h, but I've hit that a
couple of times recently on codes that didn't previously generate
warnings; I think something in Ubuntu has gotten strict about that
recently).
thanks,
-Phil
On 11/1/21 4:51 PM, Philip Davis wrote:
> Hi Phil,
>
> I’ve attached the reproducer. I see the 4th and 8th issue on Frontera,
> but not Summit. Hopefully it will build and run without too much
> modification. Let me know if there are any issues with running it (or
> if the anl listserv eats the tarball, which I kind of expect).
>
> Thanks,
> Philip
>
>
>> On Nov 1, 2021, at 11:14 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>
>> Hi Philip,
>>
>> (FYI I think the first image didn't come through in your email, but I
>> think the others are sufficient to get across what you are seeing)
>>
>> I don't have any idea what would cause that. The recently released
>> libfabric 1.13.2 (available in spack from the mochi-spack-packages
>> repo) includes some fixes to the rxm provider that could be relevant
>> to Frontera and Summit, but nothing that aligns with what you are
>> observing.
>>
>> If it were later in the sequence (much later) I would speculate that
>> memory allocation/deallocation cycles were eventually causing a
>> hiccup. We've seen something like that in the past, and it's a
>> theory that we could then test with alternative allocators like
>> jemalloc. That's not memory allocation jitter that early in the run
>> though.
>>
>> Please do share your reproducer if you don't mind! We can try a few
>> systems here to at least isolate if it is something peculiar to the
>> InfiniBand path or if there is a more general problem in Margo.
>>
>> thanks,
>>
>> -Phil
>>
>> On 10/29/21 3:20 PM, Philip Davis wrote:
>>> Hello,
>>>
>>> I apologize in advance for the winding nature of this email. I’m not
>>> sure how to ask my question without explaining the story of my
>>> results some.
>>>
>>> I’m doing some characterization of our server performance under
>>> load, and I have a quirk of performance that I wanted to run by
>>> people to see if they make sense. My testing so far has been to
>>> iteratively send batches of RPCs using margo_iforward, and then
>>> measure the wait time until they are all complete. On the server
>>> side, handling the RPC includes a margo_bulk_transfer as a pull
>>> initiated on the server to pull (for now) 8 bytes. The payload of
>>> the RPC request is about 500 bytes, and the response payload is 4
>>> bytes.
>>>
>>> I’ve isolated my results down to one server rank and one client
>>> rank, because it’s an easier starting point to reason from. Below is
>>> a graph of some of my initial results. These results are from
>>> Frontera. The median times are good (a single RPC takes on the order
>>> of 10s of microseconds, which seems fantastic). However, the
>>> outliers are fairly high (note the log scale of the y-axis). With
>>> only one RPC per timestep, for example, there is a 30x spread
>>> between the median and the max.
>>>
>>>
>>> I was hoping (expecting) the first timestep would be where the long
>>> replies resided, but that turned out not to be the case. Below are
>>> traces from the 1 RPC (blue) and 2 RPC (orange) per timestep cases,
>>> 5 trials of 10 timesteps for each case (normalized to fix the same
>>> x-axis):
>>>
>>> <PastedGraphic-6.png>
>>>
>>> What strikes me is how consistent these results are across trials.
>>> For the 1 RPC per timestep case, the 3rd and 7th timestep are
>>> consistently//slow (and the rest are fast). For the 2 RPC per
>>> timestep case, the 2nd and 4th timestep are always slow and
>>> sometimes the 10th is. These results are repeatable with very rare
>>> variation.
>>>
>>> For the single RPC case, I recorded some timers on the server side,
>>> and attempted to overlay them with the client side (there is some
>>> unknown offset, but probably on the order of 10s of microseconds at
>>> worst, given the pattern):
>>>
>>> <PastedGraphic-7.png>
>>>
>>> I blew up the first few timesteps of one of the trials:
>>> <PastedGraphic-8.png>
>>>
>>> The different colors are different segments of the handler, but
>>> there doesn’t seem to be anything too interesting going on inside
>>> the handler. So it looks like the time is being introduced before
>>> the 3rd RPC handler starts, based on the where the gap appears on
>>> the server side.
>>>
>>> To try and isolate any dataspaces-specific behavior, I created a
>>> pure Margo test case that just sends a single rpc of the same size
>>> as dataspaces iteratively, whre the server side does an 8-byte bulk
>>> transfer initiated by the server, and sends a response. The results
>>> are similar, except that it is now the 4th and 8th timestep that are
>>> slow (and the first timestep is VERY long, presumably because rxm
>>> communication state is being established. DataSpaces has an earlier
>>> RPC in its init that was absorbing this latency).
>>>
>>> I got margo profiling results for this test case:
>>>
>>> ```
>>> 3
>>> 18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573
>>> 0xa2a1,term_rpc
>>> 0x27b5,put_rpc
>>> 0xd320,__shutdown__
>>> 0x27b5
>>> ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>> 0x27b5 ,0;0.041241646,200.000000000, 0;
>>> 0xa2a1
>>> ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>> 0xa2a1 ,0;0.000009298,1.000000000, 0;
>>> ```
>>>
>>> So I guess my question at this point is, is there any sensible
>>> reason why the 4th and 8th RPC sent would have a long response time?
>>> I think I’ve cleared my code on the client side and server side, so
>>> it appears to be latency being introduced by Margo, LibFabric,
>>> Argobots, or the underlying OS. I do see long timesteps occasionally
>>> after this (perhaps every 20-30 timesteps) but these are not consistent.
>>>
>>> One last detail: this does not happen on Summit. On summit, I see
>>> about 5-7x worse single-RPC performance (250-350 microseconds per
>>> RPC), but without the intermittent long timesteps.
>>>
>>> I can provide the minimal test case if it would be helpful. I am
>>> using APEX for timing results, and the following dependencies with
>>> Spack:
>>>
>>> argobots at 1.1 <mailto:argobots at 1.1> json-c at 0.15 <mailto:json-c at 0.15>
>>> libfabric at 1.13.1 <mailto:libfabric at 1.13.1> mercury at 2.0.1
>>> <mailto:mercury at 2.0.1> mochi-margo at 0.9.5 <mailto:mochi-margo at 0.9.5>
>>> rdma-core at 20
>>>
>>> Thanks,
>>> Philip
>>>
>>>
>>>
>>> _______________________________________________
>>> mochi-devel mailing list
>>> mochi-devel at lists.mcs.anl.gov <mailto:mochi-devel at lists.mcs.anl.gov>
>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7Cd561df6d1b3a415d98a508d99d4a62df%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637713765117067440%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rqjBzeeOVRXpIymaVQYBc%2Bjc0n%2BrYZUeZHqXDWBr5o8%3D&reserved=0>
>>> https://www.mcs.anl.gov/research/projects/mochi <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7Cd561df6d1b3a415d98a508d99d4a62df%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637713765117067440%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NtfbQ069ftcjDe%2F8e2kU0uGlEh5GLQkcqjuy%2FNW9kZ4%3D&reserved=0>
>> _______________________________________________
>> mochi-devel mailing list
>> mochi-devel at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>> https://www.mcs.anl.gov/research/projects/mochi
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211102/b78608dc/attachment-0001.html>
More information about the mochi-devel
mailing list