[petsc-dev] PETSc and threads

Matthew Knepley knepley at gmail.com
Wed Mar 4 07:04:32 CST 2015


On Wed, Mar 4, 2015 at 1:17 AM, Richard Mills <rtm at utk.edu> wrote:

> Resurrecting old thread here:
>
> I realize that I haven't contributed any code to PETSc in about 1.5 years,
> and this makes me sad, *especially* with PETSc's 20th birthday coming up.
> If it is not going to step on anyone's toes, I'd like to start doing some
> of the implementation work that Barry outlines below (and try to learn
> something about MPI 3 in the process, of which I've read some and coded
> none).  Any suggestions on where I should start, guys?  Maybe VecScatter
> because it already has support for a bunch of various back-ends: start
> adding -vecscatter_neighbor_alltotall?
>

With respect to this, what about:

  1) Making a PetscSF implementation of VecScatter

  2) Making a neighbor collective implementation for PetscSF

  3) Some scalability tests comparing normal VecScatter to the others that
can be run by anyone

   Thanks,

     Matt


> --Richard
>
> On Fri, Jan 9, 2015 at 7:11 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>> [...]
>>
>>
>>    Our plan to improve PETSc SOLVERS performance for KNL systems is to
>> provide an implementation of VecScatter, PetscSF, VecAssembly communication
>> and MatAssembly communication using MPI 3 neighborhood collectives. This
>> will also provide improvement of PETSc performance on Intel HSW systems.
>> Thus PETSc application codes and frameworks such as MOOSE would portably
>> port over to the KNL systems without major or even minor rewrites by the
>> application developers.
>>
>>   The use of MPI 3 neighborhood collectives will allow PETSc to take
>> advantage of the local memories available in the KNL and will avoid all the
>> large overhead of OpenMP parallel region launches .... blah blah blah with
>> specific data about Intel not doing better than MPI only and timings for
>> OpenMP regions.....
>>
>>    Make sense? Or do I still not understand what you are saying?
>>
>>   Barry
>>
>> Say they don't do the neighborhood collectives and perhaps
>> MPI_Win_allocate_shared properly; in fact they don't do it at all or they
>> just have terrible performance. What about using the "light weight" MPI
>> model of the RPI guys? Have a single MPI process per node but spawn all the
>> threads at the beginning and have each of these threads run like "light
>> weight" MPI processes from the user perspective able to make (a subset of)
>> MPI calls (So PETSc code and application code would look pretty much the
>> same as today except we add use of neighborhood collectives in VecScatter()
>> and PetscSF etc)? Could the RPI guys code be a starting point for this, or
>> would it just be too damn hard  for an outside group to do it properly on
>> KNL? (Essentially this means writing ourselves (with others) the parts of
>> MPI 3 we need to work well on such systems).
>>
>>
>> > I
>> > have yet to see a performance model showing why this can't perform at
>> > least as well as any MPI+thread combination.
>> >
>> > The threads might be easier for some existing applications to use.  That
>> > could be important enough to justify work on threading, but it doesn't
>> > mean we should *advocate* threading.
>> >
>> >>   Now what about "hardware threads" and pure MPI? Since Intel HSW
>> >>   seems to have 2 (or more?) hardware threads per core should there
>> >>   be 2 MPI process per core to utilize them both? Should the "extra"
>> >>   hardware threads be ignored by us? (Maybe MPI implementation can
>> >>   utilize them)? Or should we use two threads per MPI process (and
>> >>   one MPI process per core) to utilize them? Or something else?
>> >
>> > Hard to say.  Even for embarrassingly parallel operations, using
>> > multiple threads per core is not a slam dunk because you slice up all
>> > your caches.  The main benefit of hardware threads is that you get more
>> > registers and can cover more latency from poor prefetch.  Sharing cache
>> > between coordinated hardware threads is exotic and special-purpose, but
>> > a good last-step optimization.  Can it be done nearly as well with
>> > MPI_Win_allocate_shared?  Maybe; that has not been tested.
>> >
>> >>   Back when we were actively developing the PETSc thread stuff you
>> >>   supported using threads because with large domains
>> >
>> > Doesn't matter with large domains unless you are coordinating threads to
>> > share L1 cache.
>> >
>> >>   due to fewer MPI processes there are (potentially) a lot less ghost
>> >>   points needed.
>> >
>> > Surface-to-volume ratio is big for small subdomains.  If you already
>> > share caches with another process/thread, it's lower overhead to access
>> > it directly instead of copying out into separate blocks with ghosts.
>> > This is the argument for using threads or MPI_Win_allocate_shared
>> > between hardware threads sharing L1.  But if you don't stay coordinated,
>> > you're actually worse off because your working set is non-contiguous and
>> > doesn't line up with cache lines.  This will lead to erratic performance
>> > as problem size/configuration is changed.
>> >
>> > To my knowledge, the vendors have not provided super low-overhead
>> > primitives for synchronizing between hardware threads that share a core.
>> > So for example, you still need memory fences to prevent reordering
>> > stores to occur after loads.  But memory fences are expensive as the
>> > number of cores on the system goes up.  John Gunnels coordinates threads
>> > in BQG-HPL using cooperative prefetch.  That is basically a side-channel
>> > technique that is non-portable and if everything doesn't match up
>> > perfectly, you silently get bad performance.
>> >
>> > Once again, shared-nothing looks like a good default.
>>
>>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150304/abb23de5/attachment.html>


More information about the petsc-dev mailing list