<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, Jan 9, 2015 at 8:21 PM, Jed Brown <span dir="ltr"><<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> writes:<br>

<br>

>      Just say: "I support the pure MPI model on multicore systems<br>

>      including KNL" if that is the case or say what you do support;<br>

<br>

</span>I support that (with neighborhood collectives and perhaps<br>

MPI_Win_allocate_shared) if they provide a decent MPI implementation.  I<br>

have yet to see a performance model showing why this can't perform at<br>

least as well as any MPI+thread combination.<br></blockquote><div><br></div><div>I think I liked about MPI is that there seemed to be a modicum of competition.</div><div>Isn't there someone in the MPI universe with a good neighborhood collective</div><div>impl? If we can show good perf with some MPI version, people will switch, or</div><div>vendors will be shamed into submission.</div><div><br></div><div>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

The threads might be easier for some existing applications to use.  That<br>

could be important enough to justify work on threading, but it doesn't<br>

mean we should *advocate* threading.<br>

<span class=""><br>

>    Now what about "hardware threads" and pure MPI? Since Intel HSW<br>

>    seems to have 2 (or more?) hardware threads per core should there<br>

>    be 2 MPI process per core to utilize them both? Should the "extra"<br>

>    hardware threads be ignored by us? (Maybe MPI implementation can<br>

>    utilize them)? Or should we use two threads per MPI process (and<br>

>    one MPI process per core) to utilize them? Or something else?<br>

<br>

</span>Hard to say.  Even for embarrassingly parallel operations, using<br>

multiple threads per core is not a slam dunk because you slice up all<br>

your caches.  The main benefit of hardware threads is that you get more<br>

registers and can cover more latency from poor prefetch.  Sharing cache<br>

between coordinated hardware threads is exotic and special-purpose, but<br>

a good last-step optimization.  Can it be done nearly as well with<br>

MPI_Win_allocate_shared?  Maybe; that has not been tested.<br>

<span class=""><br>

>    Back when we were actively developing the PETSc thread stuff you<br>

>    supported using threads because with large domains<br>

<br>

</span>Doesn't matter with large domains unless you are coordinating threads to<br>

share L1 cache.<br>

<span class=""><br>

>    due to fewer MPI processes there are (potentially) a lot less ghost<br>

>    points needed.<br>

<br>

</span>Surface-to-volume ratio is big for small subdomains.  If you already<br>

share caches with another process/thread, it's lower overhead to access<br>

it directly instead of copying out into separate blocks with ghosts.<br>

This is the argument for using threads or MPI_Win_allocate_shared<br>

between hardware threads sharing L1.  But if you don't stay coordinated,<br>

you're actually worse off because your working set is non-contiguous and<br>

doesn't line up with cache lines.  This will lead to erratic performance<br>

as problem size/configuration is changed.<br>

<br>

To my knowledge, the vendors have not provided super low-overhead<br>

primitives for synchronizing between hardware threads that share a core.<br>

So for example, you still need memory fences to prevent reordering<br>

stores to occur after loads.  But memory fences are expensive as the<br>

number of cores on the system goes up.  John Gunnels coordinates threads<br>

in BQG-HPL using cooperative prefetch.  That is basically a side-channel<br>

technique that is non-portable and if everything doesn't match up<br>

perfectly, you silently get bad performance.<br>

<br>

Once again, shared-nothing looks like a good default.<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div>