[mpich2-dev] MPI_Alloc_mem ignores info argument and fails to register memory

Tue Sep 6 23:53:09 CDT 2011

Hi Dave,

Thanks for the explanation.  I am in a state of optimism about my
ability to contribute right now, which could be dangerous :-)

> It should be pretty easy if you look at it for a little while.  I think you just want to convert MPIDI_Alloc_mem in CH3 to be a function pointer instead of a fully defined function.  Then rename the current MPIDI_Alloc_mem to be MPIDI_Alloc_mem_default and use that as the static initializer for the MPIDI_Alloc_mem function pointer.

That's reasonable enough.

> Then in nemesis you'll need to create an implementation function to be installed as the new MPIDI_Alloc_mem pointer value.  If you want to support custom registration on a per-netmod basis (which we probably do) then you'll need to add a new routine to the netmod API to forward this call down the stack.  This last bit will be the most annoying part because of all of the new boilerplate in each of the netmods, but it shouldn't actually be difficult.

BGP requires trivial registration, I loosely understand IBV and I've
at least read about DMAPP.  It would help to have more background on
performance and limitations of such routines before completing the
decision decision.  It seems I need to follow the Bill Gropp MPI 0.1
approach and just resolve all debates by producing an implementation
that can be tested.  I have a few different ideas right now that I
think I can implement.

> http://wiki.mcs.anl.gov/mpich2/index.php/Nemesis_Network_Module_API

I will look at it.

> Also, anyone can put a feature request into the trac system.  It doesn't require special privileges, I think it just doesn't permit anonymous ticket creation/commenting.

I tried to claim the test ticket for which you encouraged me to submit
a patch while logged into Trac but couldn't.  I'll try again.

Thanks,

Jeff

> On Sep 6, 2011, at 6:42 PM CDT, Jeff Hammond wrote:
>
>> I don't think I have Trac developer rights on MPICH2 yet.
>>
>> In any case, I am reasonably confident I can implement a patch at the
>> device level for BG but that isn't useful here.  Someone with more
>> experience inside of ch3 will have to figure out where to setup an
>> interface for registration.  Maybe your efforts will be informative in
>> this regard.
>>
>> Jeff
>>
>> On Wed, Sep 7, 2011 at 1:30 AM, Howard Pritchard <howardp at cray.com> wrote:
>>> Hi Jeff,
>>>
>>> You'll be happy to know that cray already has an internal RFE filed
>>> against the cray mpich2 for exactly this kind of support.
>>> So its in the queue.
>>>
>>> Maybe you should open a track ticket against argonne mpich2?
>>>
>>> Howard
>>>
>>> Jeff Hammond wrote:
>>>> I would like to be able to use an info argument to instruct
>>>> MPI_Alloc_mem to register pinned buffers in order to maximize
>>>> performance of RMA on networks that support/require this.  Currently,
>>>> no MPICH2-derived implementation I have investigated (MPICH2,
>>>> MVAPICH2, BGP-MPI) even considers the info argument, and therefore has
>>>> no opportunity to optimize RMA using RMA-oriented buffers.  Rather,
>>>> the first RMA call with any buffer requires registration overhead,
>>>> which Jim Dinan has demonstrated to have a noticeable impact on
>>>> performance relative to ARMCI as well as a simulation of what would
>>>> happen if MPI_Alloc_mem did what I consider to be the right thing,
>>>> meaning pre-registered buffers.
>>>>
>>>> On the other hand, the ultra-modern and extremely well-designed
>>>> OpenMPI parses the info argument and provides an implementation of
>>>> preregistration when it is desired.  Note that this comment is only an
>>>> attempt to troll Pavan and should not be taken too seriously, although
>>>> I do think that OpenMPI is doing the right thing by providing the user
>>>> the option of helping MPI make an intelligent decision internally.
>>>>
>>>> The following are the comparative call paths of the two MPI
>>>> implementations under consideration:
>>>>
>>>> MPICH2 trunk:
>>>>
>>>> int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)
>>>> void *MPID_Alloc_mem( size_t size, MPID_Info *info_ptr )
>>>> void *MPIDI_Alloc_mem( size_t size, MPID_Info *info_ptr )
>>>> MPIU_Malloc(size);
>>>>
>>>> OpenMPI 1.4.3:
>>>>
>>>> int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)
>>>> void *mca_mpool_base_alloc(size_t size, ompi_info_t *info)
>>>> <stuff that actually does memory registration in appropriate cases>
>>>>
>>>> On a related subject, at the EPFL-CECAM workshop I participated in
>>>> this week, a CP2K developer commented that MPI RMA performance would
>>>> be better if, like IBM-MPI, MPICH2-derived implementations like
>>>> CrayMPI for Gemini took an info argument that allowed the user to
>>>> request immediate firing of e.g. Put, rather than the
>>>> wait-until-the-last-minute-and-pack-it approach currently employed in
>>>> CH3 (I haven't read the source but multiple MPICH2 developers have
>>>> said that this is the case).  Modern networks are very unlike Ethernet
>>>> in their ability to handle rapid injection of many small packets (Cray
>>>> Gemini is a perfect example) and therefore RMA should be flexible
>>>> enough to accommodate an implementation for an Ethernot network.  I
>>>> know from a direct implementation of noncontiguous operations in DCMF
>>>> that packing is unsuitable in many cases, particularly when the user
>>>> wants true passive-target progress without user interrupts.  This is
>>>> actually the use case of my collaborator at Juelich.
>>>>
>>>> Anyways, neither of my points is particularly new information to Jim
>>>> and Pavan, but I wanted to summarize it all here now that I have more
>>>> specific information to add, particularly the apparent superiority of
>>>> OpenMPI to MPICH2 in one particular instance :-)
>>>>
>>>> Best,
>>>>
>>>> Jeff
>>>>
>>>
>>>
>>> --
>>> Howard Pritchard
>>> Software Engineering
>>> Cray, Inc.
>>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/index.php/User:Jhammond
>
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/index.php/User:Jhammond