[petsc-dev] PETSc and threads

Mon Jan 19 20:38:15 CST 2015

Dave Nystrom <Dave.Nystrom at tachyonlogic.com> writes:

> When you say getting good performance with threads is hard, do you mean for
> complicated preconditioners like multigrid and incomplete factorization
> methods?  Or do you mean that it is hard to write a good cg solver with
> simple preconditioning methods like jacobi and block jacobi?

Define good.  If you mean "runs faster than well-design MPI-only across
a range of parameters", then it's a tough challenge even for a bogus
algorithm like CG/Jacobi.  

> You say the reason is because "MPI+OpenMP is a crappy programming model".
> What about MPI+Pthreads?

Same issues, but at least it's idiomatic to use lower level primitives
instead of the crappy ones OpenMP provides.  The message
packing/unpacking problems remain -- either poor bandwidth, poor
latency, or both, unless you eschew the idea that the parallelism is
contained within the public API (versus called thread-collectively, a
programming model that no other libraries use).

>  > I cite HPGMG-FV as an example because Sam understands hardware well and
>  > conceived that code from the ground up for threads, yet it executes
>  > faster with MPI on most machines at all problem sizes.
>  > 
>  > I posit that most examples of threads making a PDE solver faster are due
>  > to poor use of MPI, poor choice of algorithm, or contrived
>  > configuration.  I want to make the science and engineering that matters
>  > faster, not check a box saying that we "do threads".
>
> Well, so do I - to your last sentence.  But is it really possible to run 300+
> MPI ranks on a single node as efficiently as running a single rank on a node
> plus 300+ threads - where the threads are pthreads or perhaps a special light
> weight thread?  That is an honest question, not a rhetorical one because I
> don't really know how light weight a vendor could make MPI ranks on a node
> versus threads on a node.  And it seems that the number of threads or MPI
> processes per node is going to continue to get larger and larger as the march
> to exascale continues.  A lot of people seem to think we will need MPI+X with
> MPI just being used between nodes.  

A lot of people say whatever makes their product sell (be it a research
program, hardware, or software).  That doesn't make it correct or even
what they predict will happen.  Current NICs have hardware support for a
number of software contexts.  Usually those contexts are message queues
of some sort that the hardware polls, so software does not have to lock
or otherwise serialize to those contexts.  Current implementations
typically have one NIC context per MPI process.  I think it would be
difficult to have a semantically correct MPI implementation that uses
multiple NIC contexts per process.  Anyway, if you only use one context
per node, you have to serialize all "threads" in software.  That's slow
as hell, especially on the new throughput architectures that do
everything badly, but are even worse at synchronization latency.

So what else can be done to cut latency further or improve bandwidth?
Over-decomposition (one subdomain per hardware thread) results in more
messages or the need to coalesce.  By using fewer processes, we could
have threads work together to pack coalesced deduplicated buffers.  That
would be fewer messages and less bandwidth, so maybe it's a good idea.
But packing represents a sizable fraction of messaging cost, so doing it
in serial is non-scalable and if you use omp parallel to pack, you've
just incurred a latency cost much larger than MPI messaging latency.

  http://mid.mail-archive.com/87oaq7ech3.fsf@jedbrown.org

If all you want is message coalescing, it's possible to do scalably in
software with neighborhood collectives.  The implementations don't do
this now, but they could at similar cost to the best generic threaded
implementations.  Deduplication would also be possible with "w" versions
and persistent neighborhood collectives (not currently part of the
standard), but the analysis is nontrivial.

So you want threads to pack deduplicated coalesced buffers in parallel,
but you can't afford to use omp parallel or omp barrier, so you need
thread-collective interfaces with fine-grained coordination that does
not use the crude OpenMP primitives.  But nobody writes libraries that
way and it's not really what you're asking for.

Private is a better default.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150119/ef26dd68/attachment.sig>