[Mochi-devel] RPC latency question

Phil Carns carns at mcs.anl.gov
Tue Nov 2 10:48:57 CDT 2021


Awesome, thanks Philip.  It came through fine.

I started by modifying the job script slightly to just run it on my 
laptop with sm (I wanted to make sure I understood the test case, and 
how to use apex, before trying elsewhere).  Does in.size needs to be set 
in client.c?  For me there is a random value in that field and it is 
causing the encoder on the forward to attempt a very large allocation.  
The same might be true of gdim_size if it got past that step.  I started 
to alter them but then I wasn't sure what the implications were.

(fwiw I needed to include stdlib.h in common.h, but I've hit that a 
couple of times recently on codes that didn't previously generate 
warnings; I think something in Ubuntu has gotten strict about that 
recently).

thanks,

-Phil


On 11/1/21 4:51 PM, Philip Davis wrote:
> Hi Phil,
>
> I’ve attached the reproducer. I see the 4th and 8th issue on Frontera, 
> but not Summit. Hopefully it will build and run without too much 
> modification. Let me know if there are any issues with running it (or 
> if the anl listserv eats the tarball, which I kind of expect).
>
> Thanks,
> Philip
>
>
>> On Nov 1, 2021, at 11:14 AM, Phil Carns <carns at mcs.anl.gov> wrote:
>>
>> Hi Philip,
>>
>> (FYI I think the first image didn't come through in your email, but I 
>> think the others are sufficient to get across what you are seeing)
>>
>> I don't have any idea what would cause that.  The recently released 
>> libfabric 1.13.2 (available in spack from the mochi-spack-packages 
>> repo) includes some fixes to the rxm provider that could be relevant 
>> to Frontera and Summit, but nothing that aligns with what you are 
>> observing.
>>
>> If it were later in the sequence (much later) I would speculate that 
>> memory allocation/deallocation cycles were eventually causing a 
>> hiccup.  We've seen something like that in the past, and it's a 
>> theory that we could then test with alternative allocators like 
>> jemalloc.  That's not memory allocation jitter that early in the run 
>> though.
>>
>> Please do share your reproducer if you don't mind!  We can try a few 
>> systems here to at least isolate if it is something peculiar to the 
>> InfiniBand path or if there is a more general problem in Margo.
>>
>> thanks,
>>
>> -Phil
>>
>> On 10/29/21 3:20 PM, Philip Davis wrote:
>>> Hello,
>>>
>>> I apologize in advance for the winding nature of this email. I’m not 
>>> sure how to ask my question without explaining the story of my 
>>> results some.
>>>
>>> I’m doing some characterization of our server performance under 
>>> load, and I have a quirk of performance that I wanted to run by 
>>> people to see if they make sense. My testing so far has been to 
>>> iteratively send batches of RPCs using margo_iforward, and then 
>>> measure the wait time until they are all complete. On the server 
>>> side, handling the RPC includes a margo_bulk_transfer as a pull 
>>> initiated on the server to pull (for now) 8 bytes. The payload of 
>>> the RPC request is about 500 bytes, and the response payload is 4 
>>> bytes.
>>>
>>> I’ve isolated my results down to one server rank and one client 
>>> rank, because it’s an easier starting point to reason from. Below is 
>>> a graph of some of my initial results. These results are from 
>>> Frontera. The median times are good (a single RPC takes on the order 
>>> of 10s of microseconds, which seems fantastic). However, the 
>>> outliers are fairly high (note the log scale of the y-axis). With 
>>> only one RPC per timestep, for example, there is a 30x spread 
>>> between the median and the max.
>>>
>>>
>>> I was hoping (expecting) the first timestep would be where the long 
>>> replies resided, but that turned out not to be the case. Below are 
>>> traces from the 1 RPC (blue) and 2 RPC  (orange) per timestep cases, 
>>> 5 trials of 10 timesteps for each case (normalized to fix the same 
>>> x-axis):
>>>
>>> <PastedGraphic-6.png>
>>>
>>> What strikes me is how consistent these results are across trials. 
>>> For the 1 RPC per timestep case, the 3rd and 7th timestep are 
>>> consistently//slow (and the rest are fast). For the 2 RPC per 
>>> timestep case, the 2nd and 4th timestep are always slow and 
>>> sometimes the 10th is. These results are repeatable with very rare 
>>> variation.
>>>
>>> For the single RPC case, I recorded some timers on the server side, 
>>> and attempted to overlay them with the client side (there is some 
>>> unknown offset, but probably on the order of 10s of microseconds at 
>>> worst, given the pattern):
>>>
>>> <PastedGraphic-7.png>
>>>
>>> I blew up the first few timesteps of one of the trials:
>>> <PastedGraphic-8.png>
>>>
>>> The different colors are different segments of the handler, but 
>>> there doesn’t seem to be anything too interesting going on inside 
>>> the handler. So it looks like the time is being introduced before 
>>> the 3rd RPC handler starts, based on the where the gap appears on 
>>> the server side.
>>>
>>> To try and isolate any dataspaces-specific behavior, I created a 
>>> pure Margo test case that just sends a single rpc of the same size 
>>> as dataspaces iteratively, whre the server side does an 8-byte bulk 
>>> transfer initiated by the server, and sends a response. The results 
>>> are similar, except that it is now the 4th and 8th timestep that are 
>>> slow (and the first timestep is VERY long, presumably because rxm 
>>> communication state is being established. DataSpaces has an earlier 
>>> RPC in its init that was absorbing this latency).
>>>
>>> I got margo profiling results for this test case:
>>>
>>> ```
>>> 3
>>> 18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573
>>> 0xa2a1,term_rpc
>>> 0x27b5,put_rpc
>>> 0xd320,__shutdown__
>>> 0x27b5 
>>> ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>> 0x27b5 ,0;0.041241646,200.000000000, 0;
>>> 0xa2a1 
>>> ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0
>>> 0xa2a1 ,0;0.000009298,1.000000000, 0;
>>> ```
>>>
>>> So I guess my question at this point is, is there any sensible 
>>> reason why the 4th and 8th RPC sent would have a long response time? 
>>> I think I’ve cleared my code on the client side and server side, so 
>>> it appears to be latency being introduced by Margo, LibFabric, 
>>> Argobots, or the underlying OS. I do see long timesteps occasionally 
>>> after this (perhaps every 20-30 timesteps) but these are not consistent.
>>>
>>> One last detail: this does not happen on Summit. On summit, I see 
>>> about 5-7x worse single-RPC performance (250-350 microseconds per 
>>> RPC), but without the intermittent long timesteps.
>>>
>>> I can provide the minimal test case if it would be helpful. I am 
>>> using APEX for timing results, and the following dependencies with 
>>> Spack:
>>>
>>> argobots at 1.1 <mailto:argobots at 1.1> json-c at 0.15 <mailto:json-c at 0.15> 
>>> libfabric at 1.13.1 <mailto:libfabric at 1.13.1> mercury at 2.0.1 
>>> <mailto:mercury at 2.0.1> mochi-margo at 0.9.5 <mailto:mochi-margo at 0.9.5> 
>>>  rdma-core at 20
>>>
>>> Thanks,
>>> Philip
>>>
>>>
>>>
>>> _______________________________________________
>>> mochi-devel mailing list
>>> mochi-devel at lists.mcs.anl.gov  <mailto:mochi-devel at lists.mcs.anl.gov>
>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel  <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.mcs.anl.gov%2Fmailman%2Flistinfo%2Fmochi-devel&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7Cd561df6d1b3a415d98a508d99d4a62df%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637713765117067440%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rqjBzeeOVRXpIymaVQYBc%2Bjc0n%2BrYZUeZHqXDWBr5o8%3D&reserved=0>
>>> https://www.mcs.anl.gov/research/projects/mochi  <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mcs.anl.gov%2Fresearch%2Fprojects%2Fmochi&data=04%7C01%7Cphilip.e.davis%40rutgers.edu%7Cd561df6d1b3a415d98a508d99d4a62df%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637713765117067440%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NtfbQ069ftcjDe%2F8e2kU0uGlEh5GLQkcqjuy%2FNW9kZ4%3D&reserved=0>
>> _______________________________________________
>> mochi-devel mailing list
>> mochi-devel at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>> https://www.mcs.anl.gov/research/projects/mochi
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211102/b78608dc/attachment-0001.html>


More information about the mochi-devel mailing list