[petsc-dev] PETSc and threads

Fri Jan 9 20:21:02 CST 2015

Barry Smith <bsmith at mcs.anl.gov> writes:

>      Just say: "I support the pure MPI model on multicore systems
>      including KNL" if that is the case or say what you do support; 

I support that (with neighborhood collectives and perhaps
MPI_Win_allocate_shared) if they provide a decent MPI implementation.  I
have yet to see a performance model showing why this can't perform at
least as well as any MPI+thread combination.

The threads might be easier for some existing applications to use.  That
could be important enough to justify work on threading, but it doesn't
mean we should *advocate* threading.

>    Now what about "hardware threads" and pure MPI? Since Intel HSW
>    seems to have 2 (or more?) hardware threads per core should there
>    be 2 MPI process per core to utilize them both? Should the "extra"
>    hardware threads be ignored by us? (Maybe MPI implementation can
>    utilize them)? Or should we use two threads per MPI process (and
>    one MPI process per core) to utilize them? Or something else?

Hard to say.  Even for embarrassingly parallel operations, using
multiple threads per core is not a slam dunk because you slice up all
your caches.  The main benefit of hardware threads is that you get more
registers and can cover more latency from poor prefetch.  Sharing cache
between coordinated hardware threads is exotic and special-purpose, but
a good last-step optimization.  Can it be done nearly as well with
MPI_Win_allocate_shared?  Maybe; that has not been tested.

>    Back when we were actively developing the PETSc thread stuff you
>    supported using threads because with large domains 

Doesn't matter with large domains unless you are coordinating threads to
share L1 cache.

>    due to fewer MPI processes there are (potentially) a lot less ghost
>    points needed. 

Surface-to-volume ratio is big for small subdomains.  If you already
share caches with another process/thread, it's lower overhead to access
it directly instead of copying out into separate blocks with ghosts.
This is the argument for using threads or MPI_Win_allocate_shared
between hardware threads sharing L1.  But if you don't stay coordinated,
you're actually worse off because your working set is non-contiguous and
doesn't line up with cache lines.  This will lead to erratic performance
as problem size/configuration is changed.

To my knowledge, the vendors have not provided super low-overhead
primitives for synchronizing between hardware threads that share a core.
So for example, you still need memory fences to prevent reordering
stores to occur after loads.  But memory fences are expensive as the
number of cores on the system goes up.  John Gunnels coordinates threads
in BQG-HPL using cooperative prefetch.  That is basically a side-channel
technique that is non-portable and if everything doesn't match up
perfectly, you silently get bad performance.

Once again, shared-nothing looks like a good default.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150109/3a8c8a74/attachment.sig>