<div class="gmail_quote">On Tue, Dec 20, 2011 at 16:51, Dave Goodell <span dir="ltr">&lt;<a href="mailto:goodell@mcs.anl.gov">goodell@mcs.anl.gov</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div id=":2ei">No, although it&#39;s not likely to be fixed within the next few months either.  Some of this can become a bit better with MPI-3 RMA, although possibly not for your use case.<br>

<br>

I&#39;m not sure if 100% of the synchronization will be able to be eliminated.  But we can almost certainly fix all of the memory scalability issues, given enough software development effort.<br></div></blockquote><div><br>

</div><div>If there was a way to create a window, post, and start without imposing a hard synchronization, it would be useful. Alternatively, if we could &quot;re-seat&quot; a window by giving it different memory...</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":2ei">

<div class="im"><br>

&gt; 2. How expensive should I consider this operation to be? Are there micro-benchmark results scaling out to 10k-100k cores somewhere?<br>

<br>

</div>Take a look at page 20 of this IBM slide deck that I found with a little bit of googling: <a href="http://www.scc.acad.bg/articles/library/BLue%20Gene%20P/MPI%20Collective%20Communications%20on%20The%20Blue%20Gene%20P.pdf" target="_blank">http://www.scc.acad.bg/articles/library/BLue%20Gene%20P/MPI%20Collective%20Communications%20on%20The%20Blue%20Gene%20P.pdf</a><br>


<br>

It does show microbenchmark performance for Blue Gene/P for a variety of message sizes.  I&#39;m guessing it&#39;s 16k processes based on the labels from the other plots.  Unfortunately it doesn&#39;t show Allgather performance for the worst case, a non-MPI_COMM_WORLD, non-rectangular communicator.  I&#39;d say though that you are looking at something on the order of 1000 to 10000 us for this data size.<br>

</div></blockquote><div><br></div><div>Thanks. This is the time to memcpy a few megabytes, a reasonably typical data volume. I won&#39;t stress over getting both variants implemented right away, but I think persistent windows with a copy will be used more frequently. That variant also gives me the opportunity to do my own packing (hidden from the user).</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":2ei">

I&#39;m sure that someone at ALCF could point us towards some more useful data for BG/P and BG/Q.  They would also have hard numbers on memcpy performance.<br></div></blockquote><div><br></div><div>Memcpy is easy because it&#39;s also STREAM copy.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":2ei">

<div class="im"><br>

&gt; I&#39;ll end up implementing both versions, but it would be nice to know how urgent it is likely to be and have a guess for where to place the threshold.<br>

<br>

</div>I think it will be hard for us to predict with any certainty and will depend substantially on the relative performance between the system&#39;s network and memcpy.  A Blue Gene system will have a very different ratio than an Intel cluster, for example.</div>

</blockquote></div><br><div>Yup, but it&#39;s still nice to know which orders of magnitude to look at.</div>