[Mochi-devel] RPC latency question

Phil Carns carns at mcs.anl.gov
Mon Nov 1 10:14:52 CDT 2021


Hi Philip,

(FYI I think the first image didn't come through in your email, but I 
think the others are sufficient to get across what you are seeing)

I don't have any idea what would cause that.  The recently released 
libfabric 1.13.2 (available in spack from the mochi-spack-packages repo) 
includes some fixes to the rxm provider that could be relevant to 
Frontera and Summit, but nothing that aligns with what you are observing.

If it were later in the sequence (much later) I would speculate that 
memory allocation/deallocation cycles were eventually causing a hiccup.  
We've seen something like that in the past, and it's a theory that we 
could then test with alternative allocators like jemalloc.  That's not 
memory allocation jitter that early in the run though.

Please do share your reproducer if you don't mind!  We can try a few 
systems here to at least isolate if it is something peculiar to the 
InfiniBand path or if there is a more general problem in Margo.

thanks,

-Phil

On 10/29/21 3:20 PM, Philip Davis wrote:
> Hello,
>
> I apologize in advance for the winding nature of this email. I’m not 
> sure how to ask my question without explaining the story of my results 
> some.
>
> I’m doing some characterization of our server performance under load, 
> and I have a quirk of performance that I wanted to run by people to 
> see if they make sense. My testing so far has been to iteratively send 
> batches of RPCs using margo_iforward, and then measure the wait time 
> until they are all complete. On the server side, handling the RPC 
> includes a margo_bulk_transfer as a pull initiated on the server to 
> pull (for now) 8 bytes. The payload of the RPC request is about 500 
> bytes, and the response payload is 4 bytes.
>
> I’ve isolated my results down to one server rank and one client rank, 
> because it’s an easier starting point to reason from. Below is a graph 
> of some of my initial results. These results are from Frontera. The 
> median times are good (a single RPC takes on the order of 10s of 
> microseconds, which seems fantastic). However, the outliers are fairly 
> high (note the log scale of the y-axis). With only one RPC per 
> timestep, for example, there is a 30x spread between the median and 
> the max.
>
>
> I was hoping (expecting) the first timestep would be where the long 
> replies resided, but that turned out not to be the case. Below are 
> traces from the 1 RPC (blue) and 2 RPC  (orange) per timestep cases, 5 
> trials of 10 timesteps for each case (normalized to fix the same x-axis):
>
>
> What strikes me is how consistent these results are across trials. For 
> the 1 RPC per timestep case, the 3rd and 7th timestep are 
> consistently//slow (and the rest are fast). For the 2 RPC per timestep 
> case, the 2nd and 4th timestep are always slow and sometimes the 10th 
> is. These results are repeatable with very rare variation.
>
> For the single RPC case, I recorded some timers on the server side, 
> and attempted to overlay them with the client side (there is some 
> unknown offset, but probably on the order of 10s of microseconds at 
> worst, given the pattern):
>
>
> I blew up the first few timesteps of one of the trials:
>
> The different colors are different segments of the handler, but there 
> doesn’t seem to be anything too interesting going on inside the 
> handler. So it looks like the time is being introduced before the 3rd 
> RPC handler starts, based on the where the gap appears on the server side.
>
> To try and isolate any dataspaces-specific behavior, I created a pure 
> Margo test case that just sends a single rpc of the same size as 
> dataspaces iteratively, whre the server side does an 8-byte bulk 
> transfer initiated by the server, and sends a response. The results 
> are similar, except that it is now the 4th and 8th timestep that are 
> slow (and the first timestep is VERY long, presumably because rxm 
> communication state is being established. DataSpaces has an earlier 
> RPC in its init that was absorbing this latency).
>
> I got margo profiling results for this test case:
>
> ```
> 3
> 18446744025556676964,ofi+verbs;ofi_rxm://192.168.72.245:39573
> 0xa2a1,term_rpc
> 0x27b5,put_rpc
> 0xd320,__shutdown__
> 0x27b5 
> ,0.000206208,10165,18446744027256353016,0,0.041241646,0.000045538,0.025733232,200,18446744073709551615,286331153,0,18446744073709551615,286331153,0
> 0x27b5 ,0;0.041241646,200.000000000, 0;
> 0xa2a1 
> ,0.000009298,41633,18446744027256353016,0,0.000009298,0.000009298,0.000009298,1,18446744073709551615,286331153,0,18446744073709551615,286331153,0
> 0xa2a1 ,0;0.000009298,1.000000000, 0;
> ```
>
> So I guess my question at this point is, is there any sensible reason 
> why the 4th and 8th RPC sent would have a long response time? I think 
> I’ve cleared my code on the client side and server side, so it appears 
> to be latency being introduced by Margo, LibFabric, Argobots, or the 
> underlying OS. I do see long timesteps occasionally after this 
> (perhaps every 20-30 timesteps) but these are not consistent.
>
> One last detail: this does not happen on Summit. On summit, I see 
> about 5-7x worse single-RPC performance (250-350 microseconds per 
> RPC), but without the intermittent long timesteps.
>
> I can provide the minimal test case if it would be helpful. I am using 
> APEX for timing results, and the following dependencies with Spack:
>
> argobots at 1.1 json-c at 0.15 libfabric at 1.13.1 mercury at 2.0.1 
> mochi-margo at 0.9.5  rdma-core at 20
>
> Thanks,
> Philip
>
>
>
> _______________________________________________
> mochi-devel mailing list
> mochi-devel at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
> https://www.mcs.anl.gov/research/projects/mochi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211101/40a2abea/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PastedGraphic-6.png
Type: image/png
Size: 58798 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211101/40a2abea/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PastedGraphic-7.png
Type: image/png
Size: 65334 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211101/40a2abea/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PastedGraphic-8.png
Type: image/png
Size: 58142 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mochi-devel/attachments/20211101/40a2abea/attachment-0005.png>


More information about the mochi-devel mailing list