To present a &quot;nice&quot; API, I have a choice between trying to reuse windows internally, which will usually involve a local memcpy(), and creating new windows to address user&#39;s memory (after checking that it is suitably aligned). I foresee caching and reusing windows for small operations and creating new windows for sufficiently large operations. In the MPICH2 implementation, I see an MPI_Allgather() with 3 MPI_Aint for ch3 and with a 32-byte struct for DCMF.<div>

<br></div><div>1. Is the implementation likely to always perform this synchronizing and not strictly memory-scalable MPI_Allgather()?</div><div><br></div><div>2. How expensive should I consider this operation to be? Are there micro-benchmark results scaling out to 10k-100k cores somewhere?</div>

<div><br></div><div>I&#39;ll end up implementing both versions, but it would be nice to know how urgent it is likely to be and have a guess for where to place the threshold.</div>