[petsc-dev] PETSc and threads

Fri Jan 9 21:57:08 CST 2015

On Fri, Jan 9, 2015 at 8:21 PM, Jed Brown <jed at jedbrown.org> wrote:

> Barry Smith <bsmith at mcs.anl.gov> writes:
>
> >      Just say: "I support the pure MPI model on multicore systems
> >      including KNL" if that is the case or say what you do support;
>
> I support that (with neighborhood collectives and perhaps
> MPI_Win_allocate_shared) if they provide a decent MPI implementation.  I
> have yet to see a performance model showing why this can't perform at
> least as well as any MPI+thread combination.
>

I think I liked about MPI is that there seemed to be a modicum of
competition.
Isn't there someone in the MPI universe with a good neighborhood collective
impl? If we can show good perf with some MPI version, people will switch, or
vendors will be shamed into submission.

   Matt

> The threads might be easier for some existing applications to use.  That
> could be important enough to justify work on threading, but it doesn't
> mean we should *advocate* threading.
>
> >    Now what about "hardware threads" and pure MPI? Since Intel HSW
> >    seems to have 2 (or more?) hardware threads per core should there
> >    be 2 MPI process per core to utilize them both? Should the "extra"
> >    hardware threads be ignored by us? (Maybe MPI implementation can
> >    utilize them)? Or should we use two threads per MPI process (and
> >    one MPI process per core) to utilize them? Or something else?
>
> Hard to say.  Even for embarrassingly parallel operations, using
> multiple threads per core is not a slam dunk because you slice up all
> your caches.  The main benefit of hardware threads is that you get more
> registers and can cover more latency from poor prefetch.  Sharing cache
> between coordinated hardware threads is exotic and special-purpose, but
> a good last-step optimization.  Can it be done nearly as well with
> MPI_Win_allocate_shared?  Maybe; that has not been tested.
>
> >    Back when we were actively developing the PETSc thread stuff you
> >    supported using threads because with large domains
>
> Doesn't matter with large domains unless you are coordinating threads to
> share L1 cache.
>
> >    due to fewer MPI processes there are (potentially) a lot less ghost
> >    points needed.
>
> Surface-to-volume ratio is big for small subdomains.  If you already
> share caches with another process/thread, it's lower overhead to access
> it directly instead of copying out into separate blocks with ghosts.
> This is the argument for using threads or MPI_Win_allocate_shared
> between hardware threads sharing L1.  But if you don't stay coordinated,
> you're actually worse off because your working set is non-contiguous and
> doesn't line up with cache lines.  This will lead to erratic performance
> as problem size/configuration is changed.
>
> To my knowledge, the vendors have not provided super low-overhead
> primitives for synchronizing between hardware threads that share a core.
> So for example, you still need memory fences to prevent reordering
> stores to occur after loads.  But memory fences are expensive as the
> number of cores on the system goes up.  John Gunnels coordinates threads
> in BQG-HPL using cooperative prefetch.  That is basically a side-channel
> technique that is non-portable and if everything doesn't match up
> perfectly, you silently get bad performance.
>
> Once again, shared-nothing looks like a good default.
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150109/8a51ecee/attachment.html>