[mpich2-dev] MPI_Alloc_mem ignores info argument and fails to register memory

Wed Sep 7 15:51:41 CDT 2011

I originally wrote this to Darius but it might be worth sharing and
contains nothing proprietary.

DCMF requires memory registration for DMA but it is trivial thanks to
CNK's lack of virtual memory.  However, BGP-MPICH2 is still just using
MPIU_Malloc inside of MPI_Alloc_mem.  Registration has no impact on
one-sided latency because MPI_Win_create includes a vector of
DCMF_Memregions of length sizeof(win->comm), although this might be
considered a non-scalable solution (ARMCI does the same thing).

So basically, this is a non-issue on BGP because of window design and
failure to exploit BGP-specific behavior of DCMF (on BGP one can
trivially register the entire address space and just use
offsets, but this would break the theoretical portability of DCMF
other networks).

My primary interest in this is IB and Gemini, where active-message
support is lacking and registration is bloody expensive.  One reason
to enable registration is that on Cray Gemini, there are cases where
one might want to use MPI_Alloc_mem with info instructing it to use
the symmetric heap (which _requires_ a symmetric allocation request
and is mildly dangerous if a user failed to RTFM).  However, this
would allow for very efficient one-sided on the window created from
these buffers (MPI-3 fixes this with MPI_Win_allocate, but that won't
be available on Gemini for a while, if ever).

If I wanted to improve CrayPICH one-sided, I would add info arguments
for both alloc_mem and win_create that would allow me to make a window
on the symmetric heap (win_create would just verify that I did
alloc_mem symmetrically; if the user screws up, it could free the symm
heap and fall back to a non-symmetric implementation).  This would
make a DMAPP implementation very easy and extremely scalable because
windows on dupes of comm_world might be O(1) data structures.

On IB, I think this could improve performance of O(KB) size messages
which are larger than the size that should obviously use buffering but
not large enough that the ibv_reg_mr call is irrelevant.  This is just
a guess though.

Best,

Jeff

On Wed, Sep 7, 2011 at 7:49 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
> <off-list>
>
> Hey Jeff,  I'm just curious what you need Alloc_mem to register memory for.  In general, yeah, it would be good, but doesn't BG already deal with this?
>
> -d
>
> On Sep 6, 2011, at 11:53 PM, Jeff Hammond wrote:
>
>> Hi Dave,
>>
>> Thanks for the explanation.  I am in a state of optimism about my
>> ability to contribute right now, which could be dangerous :-)
>>
>>> It should be pretty easy if you look at it for a little while.  I think you just want to convert MPIDI_Alloc_mem in CH3 to be a function pointer instead of a fully defined function.  Then rename the current MPIDI_Alloc_mem to be MPIDI_Alloc_mem_default and use that as the static initializer for the MPIDI_Alloc_mem function pointer.
>>
>> That's reasonable enough.
>>
>>> Then in nemesis you'll need to create an implementation function to be installed as the new MPIDI_Alloc_mem pointer value.  If you want to support custom registration on a per-netmod basis (which we probably do) then you'll need to add a new routine to the netmod API to forward this call down the stack.  This last bit will be the most annoying part because of all of the new boilerplate in each of the netmods, but it shouldn't actually be difficult.
>>
>> BGP requires trivial registration, I loosely understand IBV and I've
>> at least read about DMAPP.  It would help to have more background on
>> performance and limitations of such routines before completing the
>> decision decision.  It seems I need to follow the Bill Gropp MPI 0.1
>> approach and just resolve all debates by producing an implementation
>> that can be tested.  I have a few different ideas right now that I
>> think I can implement.
>>
>>> http://wiki.mcs.anl.gov/mpich2/index.php/Nemesis_Network_Module_API
>>
>> I will look at it.
>>
>>> Also, anyone can put a feature request into the trac system.  It doesn't require special privileges, I think it just doesn't permit anonymous ticket creation/commenting.
>>
>> I tried to claim the test ticket for which you encouraged me to submit
>> a patch while logged into Trac but couldn't.  I'll try again.
>>
>> Thanks,
>>
>> Jeff
>>
>>> On Sep 6, 2011, at 6:42 PM CDT, Jeff Hammond wrote:
>>>
>>>> I don't think I have Trac developer rights on MPICH2 yet.
>>>>
>>>> In any case, I am reasonably confident I can implement a patch at the
>>>> device level for BG but that isn't useful here.  Someone with more
>>>> experience inside of ch3 will have to figure out where to setup an
>>>> interface for registration.  Maybe your efforts will be informative in
>>>> this regard.
>>>>
>>>> Jeff
>>>>
>>>> On Wed, Sep 7, 2011 at 1:30 AM, Howard Pritchard <howardp at cray.com> wrote:
>>>>> Hi Jeff,
>>>>>
>>>>> You'll be happy to know that cray already has an internal RFE filed
>>>>> against the cray mpich2 for exactly this kind of support.
>>>>> So its in the queue.
>>>>>
>>>>> Maybe you should open a track ticket against argonne mpich2?
>>>>>
>>>>> Howard
>>>>>
>>>>> Jeff Hammond wrote:
>>>>>> I would like to be able to use an info argument to instruct
>>>>>> MPI_Alloc_mem to register pinned buffers in order to maximize
>>>>>> performance of RMA on networks that support/require this.  Currently,
>>>>>> no MPICH2-derived implementation I have investigated (MPICH2,
>>>>>> MVAPICH2, BGP-MPI) even considers the info argument, and therefore has
>>>>>> no opportunity to optimize RMA using RMA-oriented buffers.  Rather,
>>>>>> the first RMA call with any buffer requires registration overhead,
>>>>>> which Jim Dinan has demonstrated to have a noticeable impact on
>>>>>> performance relative to ARMCI as well as a simulation of what would
>>>>>> happen if MPI_Alloc_mem did what I consider to be the right thing,
>>>>>> meaning pre-registered buffers.
>>>>>>
>>>>>> On the other hand, the ultra-modern and extremely well-designed
>>>>>> OpenMPI parses the info argument and provides an implementation of
>>>>>> preregistration when it is desired.  Note that this comment is only an
>>>>>> attempt to troll Pavan and should not be taken too seriously, although
>>>>>> I do think that OpenMPI is doing the right thing by providing the user
>>>>>> the option of helping MPI make an intelligent decision internally.
>>>>>>
>>>>>> The following are the comparative call paths of the two MPI
>>>>>> implementations under consideration:
>>>>>>
>>>>>> MPICH2 trunk:
>>>>>>
>>>>>> int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)
>>>>>> void *MPID_Alloc_mem( size_t size, MPID_Info *info_ptr )
>>>>>> void *MPIDI_Alloc_mem( size_t size, MPID_Info *info_ptr )
>>>>>> MPIU_Malloc(size);
>>>>>>
>>>>>> OpenMPI 1.4.3:
>>>>>>
>>>>>> int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)
>>>>>> void *mca_mpool_base_alloc(size_t size, ompi_info_t *info)
>>>>>> <stuff that actually does memory registration in appropriate cases>
>>>>>>
>>>>>> On a related subject, at the EPFL-CECAM workshop I participated in
>>>>>> this week, a CP2K developer commented that MPI RMA performance would
>>>>>> be better if, like IBM-MPI, MPICH2-derived implementations like
>>>>>> CrayMPI for Gemini took an info argument that allowed the user to
>>>>>> request immediate firing of e.g. Put, rather than the
>>>>>> wait-until-the-last-minute-and-pack-it approach currently employed in
>>>>>> CH3 (I haven't read the source but multiple MPICH2 developers have
>>>>>> said that this is the case).  Modern networks are very unlike Ethernet
>>>>>> in their ability to handle rapid injection of many small packets (Cray
>>>>>> Gemini is a perfect example) and therefore RMA should be flexible
>>>>>> enough to accommodate an implementation for an Ethernot network.  I
>>>>>> know from a direct implementation of noncontiguous operations in DCMF
>>>>>> that packing is unsuitable in many cases, particularly when the user
>>>>>> wants true passive-target progress without user interrupts.  This is
>>>>>> actually the use case of my collaborator at Juelich.
>>>>>>
>>>>>> Anyways, neither of my points is particularly new information to Jim
>>>>>> and Pavan, but I wanted to summarize it all here now that I have more
>>>>>> specific information to add, particularly the apparent superiority of
>>>>>> OpenMPI to MPICH2 in one particular instance :-)
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Howard Pritchard
>>>>> Software Engineering
>>>>> Cray, Inc.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/index.php/User:Jhammond
>>>
>>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/index.php/User:Jhammond
>
>

--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/index.php/User:Jhammond

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/index.php/User:Jhammond