[mpich-discuss] Why do predefined MPI_Ops function elementwise in MPI_Accumulate, but not in MPI-1 routines?

Tue Apr 24 14:35:06 CDT 2012

On 4/24/12 12:57 PM, Dave Goodell wrote:
> On Apr 24, 2012, at 9:37 AM CDT, Jed Brown wrote:
>
>> If the above is a reasonable thing for a user to request, then either (a) MPI_Accumulate must accept user-defined MPI_Ops, (b) MPI_SUM must operate on the base elements of a derived type for collectives (instead of just for MPI_Accumulate), or (c) the issue is delayed by adding a non-deprecated predefined std::complex<double>  type.
>>
>> I think that both (a) and (b) should be done because then we could do __float128 or quaternions without having to change the standard. (I cannot currently use __float128 with one-sided, so any time I use one-sided because it's a better algorithmic fit, I will also have to implement the algorithm using MPI-1 so that __float128 works.)
>
> (a) is potentially doable, but you'll need additional functionality to register operations that doesn't currently exist in MPI.  Also, it would be difficult to pass in the MPI Forum.

We had a lot of discussion about user-defined ops in MPI_Accumulate as 
part of the MPI-3 RMA work.  This was a potential way to sneak in some 
active message -like functionality.  Ultimately, it wasn't incorporated 
into the RMA proposal.

One of the primary problems with supporting user-defined operations in 
accumulate is that an efficient implementation essentially requires an 
asynchronous agent at the target.  Because it's an accumulate, the agent 
would have to ensure that the operation is applied atomically (w.r.t 
MPI_Accumulate and new MPI 3.0 MPI_Get_accumulate, MPI_Fetch_add, 
MPI_Compare_and_swap, etc.).  This would negatively impact all other 
atomic operations that we can do in hardware.

Even if we are willing to accept a performance penalty or find a way 
around it, creating the MPI op is a local operation.  We would need to 
extend MPI with some way of figuring out how to locate (if we require 
both source and target to register the same op) and execute the op at 
the target.

> (b) is much more feasible, assuming you apply the same restrictions about having only a single basic datatype.

This is doable, but tricky from a standardization point-of-view.  It 
would be straightforward to support as an extension.

The proposed reduction extension would generate an output element with a 
different datatype than the input datatype.  The output datatype isn't 
specified, but can be inferred from base type of the input derived 
datatype and we should double check that all processes agree on a base 
type.  I'm also not crazy about defining different behavior for builtin 
operations based on the datatype argument.  User-defined ops would, 
presumably, never be able to have this per-base-datatype behavior.

  ~Jim.