[Mochi-devel] Margo + sockets provider inexplicably slow

Phil Carns carns at mcs.anl.gov
Wed Sep 8 13:15:12 CDT 2021


Ok.  If you dig in more on the newer libfabric releases it might be 
worth checking origin/main in case there have been any relevant changes 
since 1.13.1.

thanks,

-Phil

On 9/8/21 1:47 PM, Clement Barthelemy wrote:
> Hi Phil,
>
> I'll probably try to run some more tests with libfabric 1.13, see if I can get some more useful info to the libfabric developers.
>
> In the meantime, I think your suggestion is reasonable, I've seen at least one other project using Margo that pins a specific libfabric version on installation.
>
> Thanks,
>
> Clément
>
>
> ----- Mail original -----
>> De: "Phil Carns" <carns at mcs.anl.gov>
>> À: "Clement Barthelemy" <clement.barthelemy at inria.fr>
>> Cc: "mochi-devel" <mochi-devel at lists.mcs.anl.gov>
>> Envoyé: Mercredi 8 Septembre 2021 19:27:14
>> Objet: Re: [Mochi-devel] Margo + sockets provider inexplicably slow
>> Thanks Clément (both for confirming the behavior and for reaching out to
>> the libfabric developers).
>>
>> I was kind of waiting to see if we got a response there but no such luck
>> yet :)
>>
>> We are kind of stuck here because my impression is that the socket
>> provider is at this point intended mostly as a reference implementation
>> (and thus won't get much tuning on fundamental things like threading and
>> wait object performance).  On the other hand the obvious replacement is
>> the tcp/rxm provider stack, but the tcp provider has some notable
>> deadlock bugs in the 1.13.x libfabric releases.
>>
>> I know this isn't a very satisfying answer, but I think I would probably
>> recommend using tcp/rxm with libfabric 1.12.x for now if you need TCP/IP
>> support.
>>
>> To our knowledge the other providers in the 1.13.x releases are fine.
>>
>> thanks,
>>
>> -Phil
>>
>>
>> On 9/2/21 10:16 AM, Clement Barthelemy wrote:
>>> Ok that did the trick, I'm now reaching the same order of magnitude with the
>>> sockets provider busy-looping: 5.90e-5 s/RPC.
>>>
>>> Is there anything I can do to help fix this? Do you need a bug report or
>>> something?
>>>
>>> Clément
>>>
>>>
>>> ----- Mail original -----
>>>> De: "Clement Barthelemy" <clement.barthelemy at inria.fr>
>>>> À: "Phil Carns" <carns at mcs.anl.gov>
>>>> Cc: "mochi-devel" <mochi-devel at lists.mcs.anl.gov>
>>>> Envoyé: Jeudi 2 Septembre 2021 15:53:22
>>>> Objet: Re: [Mochi-devel] Margo + sockets provider inexplicably slow
>>>> Hi Phil,
>>>>
>>>> I wrote a test with a simple RPC that takes an integer argument and returns an
>>>> integer. The client sends it 100 000 times and I simply use the unix time
>>>> command on it. My reasoning was that this would not saturate the bandwidth and
>>>> I'd be able to see the latency.
>>>>
>>>> Thanks for the advice, I'll try the busy-polling and report back.
>>>>
>>>> Clément
>>>>
>>>>
>>>> ----- Mail original -----
>>>>> De: "Phil Carns" <carns at mcs.anl.gov>
>>>>> À: "mochi-devel" <mochi-devel at lists.mcs.anl.gov>
>>>>> Envoyé: Jeudi 2 Septembre 2021 15:15:35
>>>>> Objet: Re: [Mochi-devel] Margo + sockets provider inexplicably slow
>>>>> Hi Clément,
>>>>> What benchmark are you using to generate these numbers?
>>>>> My first guess would be a difference in polling strategy (how frequently
>>>>> HG_Progress() is being called, and with what timeout value), and how well the
>>>>> provider handles that.
>>>>> One quick and dirty way to test this theory would be to set the margo_init_info
>>>>> -> hg_init_info ->
>>>>> na_init_info -> progress_mode to NA_NO_BLOCK before initializing margo with
>>>>> margo_init_ext(). There is an example that does this here:
>>>>> https://github.com/mochi-hpc-experiments/mochi-tests/blob/main/perf-regression/margo-p2p-latency.c#L105
>>>>> That's tedious from an API perspective, but the reason it might be informative
>>>>> as a quick hack is that it will force Mercury to busy poll on the underlying
>>>>> transport no matter what margo or other callers are actually asking it to do.
>>>>> It effectively short circuits any higher level polling strategy decisions.
>>>>> Depending on what that tells us, we can go from there. I suspect that the
>>>>> sockets provider mechanism for waiting for events (as opposed to just polling
>>>>> for events) might be problematic.
>>>>> thanks,
>>>>> -Phil
>>>>> On 9/2/21 8:29 AM, Clement Barthelemy wrote:
>>>>>> Hello all,
>>>>>> I did some latency measurement to compare Mercury and Margo with different
>>>>>> providers, the results are below:
>>>>>>                     Mercury (s/RPC)  Margo (s/RPC)
>>>>>> ofi+psm2            6.21e-5          5.01e-5
>>>>>> ofi+tcp;ofi_rxm     8.20e-5          9.55e-5
>>>>>> ofi+sockets         7.54e-5          2.08e-2 (!)
>>>>>> As you can see, the Margo + the sockets provider is 250 times slower than the
>>>>>> rest. I first suspected libfabric, but Mercury does not have the problem. Do
>>>>>> you know what could be causing this?
>>>>>> I've tested with Margo 0.9.5, Mercury 2.0.1 and libfabric 1.12.1 & 1.13.1.
>>>>>> Thanks,
>>>>>> Clément
>>>>>> _______________________________________________
>>>>>> mochi-devel mailing list [ mailto:mochi-devel at lists.mcs.anl.gov |
>>>>>> mochi-devel at lists.mcs.anl.gov ] [
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel |
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel ] [
>>>>>> https://www.mcs.anl.gov/research/projects/mochi |
>>>>>> https://www.mcs.anl.gov/research/projects/mochi ]
>>>>> _______________________________________________
>>>>> mochi-devel mailing list
>>>>> mochi-devel at lists.mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel
>>>>> https://www.mcs.anl.gov/research/projects/mochi


More information about the mochi-devel mailing list