[Mochi-devel] Margo + sockets provider inexplicably slow

Clement Barthelemy clement.barthelemy at inria.fr
Wed Sep 8 12:47:18 CDT 2021


Hi Phil,

I'll probably try to run some more tests with libfabric 1.13, see if I can get some more useful info to the libfabric developers.

In the meantime, I think your suggestion is reasonable, I've seen at least one other project using Margo that pins a specific libfabric version on installation.

Thanks,

Clément


----- Mail original -----
> De: "Phil Carns" <carns at mcs.anl.gov>
> À: "Clement Barthelemy" <clement.barthelemy at inria.fr>
> Cc: "mochi-devel" <mochi-devel at lists.mcs.anl.gov>
> Envoyé: Mercredi 8 Septembre 2021 19:27:14
> Objet: Re: [Mochi-devel] Margo + sockets provider inexplicably slow

> Thanks Clément (both for confirming the behavior and for reaching out to
> the libfabric developers).
> 
> I was kind of waiting to see if we got a response there but no such luck
> yet :)
> 
> We are kind of stuck here because my impression is that the socket
> provider is at this point intended mostly as a reference implementation
> (and thus won't get much tuning on fundamental things like threading and
> wait object performance).  On the other hand the obvious replacement is
> the tcp/rxm provider stack, but the tcp provider has some notable
> deadlock bugs in the 1.13.x libfabric releases.
> 
> I know this isn't a very satisfying answer, but I think I would probably
> recommend using tcp/rxm with libfabric 1.12.x for now if you need TCP/IP
> support.
> 
> To our knowledge the other providers in the 1.13.x releases are fine.
> 
> thanks,
> 
> -Phil
> 
> 
> On 9/2/21 10:16 AM, Clement Barthelemy wrote:
>> Ok that did the trick, I'm now reaching the same order of magnitude with the
>> sockets provider busy-looping: 5.90e-5 s/RPC.
>>
>> Is there anything I can do to help fix this? Do you need a bug report or
>> something?
>>
>> Clément
>>
>>
>> ----- Mail original -----
>>> De: "Clement Barthelemy" <clement.barthelemy at inria.fr>
>>> À: "Phil Carns" <carns at mcs.anl.gov>
>>> Cc: "mochi-devel" <mochi-devel at lists.mcs.anl.gov>
>>> Envoyé: Jeudi 2 Septembre 2021 15:53:22
>>> Objet: Re: [Mochi-devel] Margo + sockets provider inexplicably slow
>>> Hi Phil,
>>>
>>> I wrote a test with a simple RPC that takes an integer argument and returns an
>>> integer. The client sends it 100 000 times and I simply use the unix time
>>> command on it. My reasoning was that this would not saturate the bandwidth and
>>> I'd be able to see the latency.
>>>
>>> Thanks for the advice, I'll try the busy-polling and report back.
>>>
>>> Clément
>>>
>>>
>>> ----- Mail original -----
>>>> De: "Phil Carns" <carns at mcs.anl.gov>
>>>> À: "mochi-devel" <mochi-devel at lists.mcs.anl.gov>
>>>> Envoyé: Jeudi 2 Septembre 2021 15:15:35
>>>> Objet: Re: [Mochi-devel] Margo + sockets provider inexplicably slow
>>>> Hi Clément,
>>>> What benchmark are you using to generate these numbers?
>>>> My first guess would be a difference in polling strategy (how frequently
>>>> HG_Progress() is being called, and with what timeout value), and how well the
>>>> provider handles that.
>>>> One quick and dirty way to test this theory would be to set the margo_init_info
>>>> -> hg_init_info ->
>>>> na_init_info -> progress_mode to NA_NO_BLOCK before initializing margo with
>>>> margo_init_ext(). There is an example that does this here:
>>>> https://github.com/mochi-hpc-experiments/mochi-tests/blob/main/perf-regression/margo-p2p-latency.c#L105
>>>> That's tedious from an API perspective, but the reason it might be informative
>>>> as a quick hack is that it will force Mercury to busy poll on the underlying
>>>> transport no matter what margo or other callers are actually asking it to do.
>>>> It effectively short circuits any higher level polling strategy decisions.
>>>> Depending on what that tells us, we can go from there. I suspect that the
>>>> sockets provider mechanism for waiting for events (as opposed to just polling
>>>> for events) might be problematic.
>>>> thanks,
>>>> -Phil
>>>> On 9/2/21 8:29 AM, Clement Barthelemy wrote:
>>>>> Hello all,
>>>>> I did some latency measurement to compare Mercury and Margo with different
>>>>> providers, the results are below:
>>>>>                    Mercury (s/RPC)  Margo (s/RPC)
>>>>> ofi+psm2            6.21e-5          5.01e-5
>>>>> ofi+tcp;ofi_rxm     8.20e-5          9.55e-5
>>>>> ofi+sockets         7.54e-5          2.08e-2 (!)
>>>>> As you can see, the Margo + the sockets provider is 250 times slower than the
>>>>> rest. I first suspected libfabric, but Mercury does not have the problem. Do
>>>>> you know what could be causing this?
>>>>> I've tested with Margo 0.9.5, Mercury 2.0.1 and libfabric 1.12.1 & 1.13.1.
>>>>> Thanks,
>>>>> Clément
>>>>> _______________________________________________
>>>>> mochi-devel mailing list [ mailto:mochi-devel at lists.mcs.anl.gov |
>>>>> mochi-devel at lists.mcs.anl.gov ] [
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel |
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel ] [
>>>>> https://www.mcs.anl.gov/research/projects/mochi |
>>>>> https://www.mcs.anl.gov/research/projects/mochi ]
>>>> _______________________________________________
>>>> mochi-devel mailing list
>>>> mochi-devel at lists.mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
> >>> https://www.mcs.anl.gov/research/projects/mochi


More information about the mochi-devel mailing list