We currently have Vec{Norm,Dot,TDot,MDot,MTDot}{Begin,End}() allowing the reductions to be aggregated, but the reduction itself is only triggered lazily and always uses MPI_Allreduce(). MPICH2 has implemented MPI_Iallreduce (named MPIX_Iallreduce() until MPI-3 is finalized). I suggest adding PetscCommSplitReductionBegin(MPI_Comm) which would start any currently queued split reductions. Then, a latency-tolerant algorithm might be written<div>

<br></div><div>VecNormBegin(..,&nrm);</div><div>VecDotBegin(...,&dot);</div><div>PetscCommSplitReductionBegin(comm);</div><div>MatMult(...); // Or residual evaluation, etc.</div><div>VecNormEnd(...,&nrm);</div>

<div>VecDotEnd(...,&dot);</div><div><br></div><div><br></div><div>PetscCommSplitReductionBegin() would start the split reduction, leaving behind an MPI_Request that would be waited on by the first XXEnd().</div><div><br>

</div><div>If MPIX_Iallreduce() is not available, the current semantics would be used. If you don't call PetscCommSplitReductionBegin(), the present semantics would also be used.</div><div><br></div><div>Is this a good API?<br>

</div><div><br></div><div><br></div><div>I'd also like to propose a design change in which the PetscSplitReduction is placed in _p_Vec when it is gotten out of the MPI_Comm. The MPI_Comm would continue to hold a reference until PetscCommSplitReductionBegin() at which point it drops the reference and the only remaining references are held by the Vecs participating in the split reduction. That would enable a new split reduction to be done by other objects using the same communicator, before the results of the last split reduction have been collected by all participants.</div>