[mpich2-dev] MPI_Alloc_mem ignores info argument and fails to register memory

Tue Sep 6 17:44:53 CDT 2011

I would like to be able to use an info argument to instruct
MPI_Alloc_mem to register pinned buffers in order to maximize
performance of RMA on networks that support/require this.  Currently,
no MPICH2-derived implementation I have investigated (MPICH2,
MVAPICH2, BGP-MPI) even considers the info argument, and therefore has
no opportunity to optimize RMA using RMA-oriented buffers.  Rather,
the first RMA call with any buffer requires registration overhead,
which Jim Dinan has demonstrated to have a noticeable impact on
performance relative to ARMCI as well as a simulation of what would
happen if MPI_Alloc_mem did what I consider to be the right thing,
meaning pre-registered buffers.

On the other hand, the ultra-modern and extremely well-designed
OpenMPI parses the info argument and provides an implementation of
preregistration when it is desired.  Note that this comment is only an
attempt to troll Pavan and should not be taken too seriously, although
I do think that OpenMPI is doing the right thing by providing the user
the option of helping MPI make an intelligent decision internally.

The following are the comparative call paths of the two MPI
implementations under consideration:

MPICH2 trunk:

int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)
void *MPID_Alloc_mem( size_t size, MPID_Info *info_ptr )
void *MPIDI_Alloc_mem( size_t size, MPID_Info *info_ptr )
MPIU_Malloc(size);

OpenMPI 1.4.3:

int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)
void *mca_mpool_base_alloc(size_t size, ompi_info_t *info)
<stuff that actually does memory registration in appropriate cases>

On a related subject, at the EPFL-CECAM workshop I participated in
this week, a CP2K developer commented that MPI RMA performance would
be better if, like IBM-MPI, MPICH2-derived implementations like
CrayMPI for Gemini took an info argument that allowed the user to
request immediate firing of e.g. Put, rather than the
wait-until-the-last-minute-and-pack-it approach currently employed in
CH3 (I haven't read the source but multiple MPICH2 developers have
said that this is the case).  Modern networks are very unlike Ethernet
in their ability to handle rapid injection of many small packets (Cray
Gemini is a perfect example) and therefore RMA should be flexible
enough to accommodate an implementation for an Ethernot network.  I
know from a direct implementation of noncontiguous operations in DCMF
that packing is unsuitable in many cases, particularly when the user
wants true passive-target progress without user interrupts.  This is
actually the use case of my collaborator at Juelich.

Anyways, neither of my points is particularly new information to Jim
and Pavan, but I wanted to summarize it all here now that I have more
specific information to add, particularly the apparent superiority of
OpenMPI to MPICH2 in one particular instance :-)

Best,

Jeff

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/index.php/User:Jhammond