[petsc-dev] PETSc and threads

Fri Jan 9 14:31:53 CST 2015

Barry Smith <bsmith at mcs.anl.gov> writes:

>   Whenever one speaks about MPI+threads there is a question of how many threads t per MPI process one is talking about. (An equivalent way of stating this is how many MPI processes has per node).
>
>    Is t 2? 
>
>     is t 4, 
>
>    is it the number of hardware threads on a "single memory socket"
>
>    it is the number of hardware threads on a "CPU"
>
>    is it the number of hardware threads on the entire node? 
>
>    Depending on this one has very different challenges for getting the best performance.

True, but these are all terrible.  We prefer to express hierarchical
solvers as multi-level methods because 2-level methods tend to have
relatively narrow performance sweet spots.  As you scale the problem
size or machine size, you get bitten by either the nonscalable subdomain
solver or nonscalable coarse solver.  MPI+threads is in the same boat [1]
balancing two parts that don't scale properly.  For a particular problem
at a particular scale, it can be better than either on its own, but it's
not a general solution and surely not worth writing code just to delay
the inevitable realization that we need something actually scalable.

So what is actually scalable?  Threading an application at a coarse
level and using MPI_THREAD_MULTIPLE has all the same competition
problems as processes and is a lot more error-prone, so let's move on.

Packing buffers is a large fraction of communication time, so doing it
sequentially is bogus.  Since the MPI stack only has access to the
calling thread, MPI_THREAD_FUNNELED and datatypes is not viable.  That
means we need to recruit threads to do the packing in user space.  The
software typically looks something like this:

  loop:
    omp parallel:
      unpack recv buffers
    local_function_that_uses_omp_parallel()
    omp parallel:
      pack send buffers
    MPI sends and receives

Now we have at least three "omp parallel" regions inside the loop and
each of those is staggeringly expensive.  For example, "omp parallel" on
Xeon Phi costs about 20k cycles, which is similar to MPI_Allreduce on a
million cores of BG/Q or Cray Aries.  If we don't want to increase
messaging latency by such a huge factor, we end up with something like
this:

omp parallel:
  loop:
    omp_barrier()
    unpack my recv buffers
    thread_collective_function_called_inside_omp_parallel()
    pack my part of send buffers
    omp_barrier()
    if omp_get_thread_num() == 0:
      MPI sends and receives

but this requires a concept of thread-collective local functions.  If we
want to get decent strong scalability using threads, I think this is the
only choice, but nobody is building scientific software this way so few
people would hail it as the threaded interface they had in mind.

Now what exactly does this threaded interface offer over MPI
neighborhood collectives?  Neighborhood collectives coordinate access to
the shared resource, so it's possible to get contiguous buffers of
length N packed and unpacked in N/P+log(P) time, same as the best
possible threaded implementation.  Better, by using MPI datatypes, the
network can start moving bytes before all the packing has completed,
enabling better performance for large message sizes and avoiding the
need for huge contiguous send/recv buffers.  (Doing this with either of
the thread packing schemes above implies more complicated control flow
and more synchronization overhead.)  This approach is safer, it's easier
to debug correctness and performance, and it works equally well with no
code changes on non-coherent architectures [2].

If you really want to share something, you can do it via
MPI_Win_allocate_shared.  For example, some people use large stencils
and claim that copying those ghost cells is not viable in case of small
subdomains.  So they want threads to operate on chunks of a larger
shared multi-dimensional array.  Of course this tends to be less cache
efficient because the chunks are non-contiguous, resulting in ragged
cache lines and more TLB misses.  In fact, copying chunks into
contiguous tiles is often considered a memory optimization and even done
in serial, but if we take the assertions at face value, you're still
welcome to share via MPI_Win_allocate_shared.

MPI is certainly not perfect, but if we're going to get excited about
threading or any other programming model, we need to understand exactly
what we're giving up and what we're getting in return.

1. Are we doing it because it's the best known solution to a
   well-defined technical problem?  (What is that problem?)

2. Are we doing it because we don't like MPI as a dependency?

3. Are we doing it because vendors refuse to implement neighborhood
   collectives, shared libraries, or some other technology that is good
   when we have many processes?

4. Are we doing it because legacy applications made choices (good or
   bad; it's not our business) such that it would be easier for them to
   use PETSc if we supported threads (via some interface)?

5. Are we doing it to check a box drawn by management?

>    Are "PETSc" solvers like GAMG suppose to deliver great performance across the whole range? 

It seems more likely that MPI+threads will perform worse for all problem
sizes, unless vendors make no effort to provide on-node MPI and we don't
compensate by using MPI_Win_allocate_shared (in which case performance
should be similar to the best you could expect with MPI+threads).

>    Jed seems to be hinting at having a relatively small t possibly
>    compared to the total number of hardware threads on the node. Is
>    this correct Jed?  Could we assume in PETSc that it is always small
>    (and thus some of the performance challenges are gone).

This is the crappy 2-level assumption.  If we don't want a 3-level (and
4-level, etc.) programming model, we need to make at least one level
scalable.  And at that point, why have more than one level?  One
argument is that exotic/side-channel coordination between hardware
threads sharing L1 cache would be easier if the address spaces match.
This might be true and would imply a small t, but this is an advanced
special-purpose technique that is not widely used.

[1] Assuming no coordination between processes, as in MPI_Isend/Irecv or
MPI_Put/Get.  This results in P processes competing for a shared
resource (the network device) and acquiring it sequentially.  That's a
length-P critical path and we want log(P).

[2] Relaxing coherence is one of few ways to reduce latency for on-node
data dependencies.  I'd hate to see a bunch of people write code that
would prevent such architectures from appearing.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150109/df20d1de/attachment.sig>