<div class="gmail_quote">On Tue, Dec 20, 2011 at 16:51, Dave Goodell <span dir="ltr"><<a href="mailto:goodell@mcs.anl.gov">goodell@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div id=":2ei">No, although it's not likely to be fixed within the next few months either. Some of this can become a bit better with MPI-3 RMA, although possibly not for your use case.<br>
<br>
I'm not sure if 100% of the synchronization will be able to be eliminated. But we can almost certainly fix all of the memory scalability issues, given enough software development effort.<br></div></blockquote><div><br>
</div><div>If there was a way to create a window, post, and start without imposing a hard synchronization, it would be useful. Alternatively, if we could "re-seat" a window by giving it different memory...</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":2ei">
<div class="im"><br>
> 2. How expensive should I consider this operation to be? Are there micro-benchmark results scaling out to 10k-100k cores somewhere?<br>
<br>
</div>Take a look at page 20 of this IBM slide deck that I found with a little bit of googling: <a href="http://www.scc.acad.bg/articles/library/BLue%20Gene%20P/MPI%20Collective%20Communications%20on%20The%20Blue%20Gene%20P.pdf" target="_blank">http://www.scc.acad.bg/articles/library/BLue%20Gene%20P/MPI%20Collective%20Communications%20on%20The%20Blue%20Gene%20P.pdf</a><br>
<br>
It does show microbenchmark performance for Blue Gene/P for a variety of message sizes. I'm guessing it's 16k processes based on the labels from the other plots. Unfortunately it doesn't show Allgather performance for the worst case, a non-MPI_COMM_WORLD, non-rectangular communicator. I'd say though that you are looking at something on the order of 1000 to 10000 us for this data size.<br>
</div></blockquote><div><br></div><div>Thanks. This is the time to memcpy a few megabytes, a reasonably typical data volume. I won't stress over getting both variants implemented right away, but I think persistent windows with a copy will be used more frequently. That variant also gives me the opportunity to do my own packing (hidden from the user).</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":2ei">
I'm sure that someone at ALCF could point us towards some more useful data for BG/P and BG/Q. They would also have hard numbers on memcpy performance.<br></div></blockquote><div><br></div><div>Memcpy is easy because it's also STREAM copy.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":2ei">
<div class="im"><br>
> I'll end up implementing both versions, but it would be nice to know how urgent it is likely to be and have a guess for where to place the threshold.<br>
<br>
</div>I think it will be hard for us to predict with any certainty and will depend substantially on the relative performance between the system's network and memcpy. A Blue Gene system will have a very different ratio than an Intel cluster, for example.</div>
</blockquote></div><br><div>Yup, but it's still nice to know which orders of magnitude to look at.</div>