[mpich-discuss] How expensive is MPI_Win_create() compared to memcpy()?

Tue Dec 20 16:51:28 CST 2011

On Dec 20, 2011, at 3:10 PM CST, Jed Brown wrote:

> To present a "nice" API, I have a choice between trying to reuse windows internally, which will usually involve a local memcpy(), and creating new windows to address user's memory (after checking that it is suitably aligned). I foresee caching and reusing windows for small operations and creating new windows for sufficiently large operations. In the MPICH2 implementation, I see an MPI_Allgather() with 3 MPI_Aint for ch3 and with a 32-byte struct for DCMF.
> 
> 1. Is the implementation likely to always perform this synchronizing and not strictly memory-scalable MPI_Allgather()?

No, although it's not likely to be fixed within the next few months either.  Some of this can become a bit better with MPI-3 RMA, although possibly not for your use case.

I'm not sure if 100% of the synchronization will be able to be eliminated.  But we can almost certainly fix all of the memory scalability issues, given enough software development effort.

> 2. How expensive should I consider this operation to be? Are there micro-benchmark results scaling out to 10k-100k cores somewhere?

Take a look at page 20 of this IBM slide deck that I found with a little bit of googling: http://www.scc.acad.bg/articles/library/BLue%20Gene%20P/MPI%20Collective%20Communications%20on%20The%20Blue%20Gene%20P.pdf

It does show microbenchmark performance for Blue Gene/P for a variety of message sizes.  I'm guessing it's 16k processes based on the labels from the other plots.  Unfortunately it doesn't show Allgather performance for the worst case, a non-MPI_COMM_WORLD, non-rectangular communicator.  I'd say though that you are looking at something on the order of 1000 to 10000 us for this data size.

I'm sure that someone at ALCF could point us towards some more useful data for BG/P and BG/Q.  They would also have hard numbers on memcpy performance.

> I'll end up implementing both versions, but it would be nice to know how urgent it is likely to be and have a guess for where to place the threshold.

I think it will be hard for us to predict with any certainty and will depend substantially on the relative performance between the system's network and memcpy.  A Blue Gene system will have a very different ratio than an Intel cluster, for example.

-Dave