[petsc-dev] PETSc and threads

Fri Jan 9 19:35:36 CST 2015

   Jed,

    You are not doing a good job communicating. It now sounds like you are advocating the pure MPI model? Is that the case? If so that is fine but you have actually not said that, all you've said is KNL should support Neighbor collectives, OpenMP parallel regions suck, x sucks, y sucks,

     Just say: "I support the pure MPI model on multicore systems including KNL" if that is the case or say what you do support; by process of elimination it seems you support nothing else so you must support the pure MPI model but you've wasted a lot of time by not just saying it.  The only reason I sent the mail before and the previous rounds of email is not because I am advocating the t threads per MPI process or whatever it is because I am trying to read between your lines and understand what you are advocating. It's like someone has attached a fog maker to your Mailer.

   Now what about "hardware threads" and pure MPI? Since Intel HSW seems to have 2 (or more?) hardware threads per core should there be 2 MPI process per core to utilize them both? Should the "extra" hardware threads be ignored by us? (Maybe MPI implementation can utilize them)? Or should we use two threads per MPI process (and one MPI process per core) to utilize them? Or something else?

   Back when we were actively developing the PETSc thread stuff you supported using threads because with large domains due to  fewer MPI processes there are (potentially) a lot less ghost points needed. Did you change your mind about that? Or what? Or would you handle that via MPI_Win_allocate_shared.

  Barry

> On Jan 9, 2015, at 2:31 PM, Jed Brown <jed at jedbrown.org> wrote:
> 
> Barry Smith <bsmith at mcs.anl.gov> writes:
> 
>>  Whenever one speaks about MPI+threads there is a question of how many threads t per MPI process one is talking about. (An equivalent way of stating this is how many MPI processes has per node).
>> 
>>   Is t 2? 
>> 
>>    is t 4, 
>> 
>>   is it the number of hardware threads on a "single memory socket"
>> 
>>   it is the number of hardware threads on a "CPU"
>> 
>>   is it the number of hardware threads on the entire node? 
>> 
>>   Depending on this one has very different challenges for getting the best performance.
> 
> True, but these are all terrible.  We prefer to express hierarchical
> solvers as multi-level methods because 2-level methods tend to have
> relatively narrow performance sweet spots.  As you scale the problem
> size or machine size, you get bitten by either the nonscalable subdomain
> solver or nonscalable coarse solver.  MPI+threads is in the same boat [1]
> balancing two parts that don't scale properly.  For a particular problem
> at a particular scale, it can be better than either on its own, but it's
> not a general solution and surely not worth writing code just to delay
> the inevitable realization that we need something actually scalable.
> 
> So what is actually scalable?  Threading an application at a coarse
> level and using MPI_THREAD_MULTIPLE has all the same competition
> problems as processes and is a lot more error-prone, so let's move on.
> 
> Packing buffers is a large fraction of communication time, so doing it
> sequentially is bogus.  Since the MPI stack only has access to the
> calling thread, MPI_THREAD_FUNNELED and datatypes is not viable.  That
> means we need to recruit threads to do the packing in user space.  The
> software typically looks something like this:
> 
>  loop:
>    omp parallel:
>      unpack recv buffers
>    local_function_that_uses_omp_parallel()
>    omp parallel:
>      pack send buffers
>    MPI sends and receives
> 
> Now we have at least three "omp parallel" regions inside the loop and
> each of those is staggeringly expensive.  For example, "omp parallel" on
> Xeon Phi costs about 20k cycles, which is similar to MPI_Allreduce on a
> million cores of BG/Q or Cray Aries.  If we don't want to increase
> messaging latency by such a huge factor, we end up with something like
> this:
> 
> omp parallel:
>  loop:
>    omp_barrier()
>    unpack my recv buffers
>    thread_collective_function_called_inside_omp_parallel()
>    pack my part of send buffers
>    omp_barrier()
>    if omp_get_thread_num() == 0:
>      MPI sends and receives
> 
> but this requires a concept of thread-collective local functions.  If we
> want to get decent strong scalability using threads, I think this is the
> only choice, but nobody is building scientific software this way so few
> people would hail it as the threaded interface they had in mind.
> 
> Now what exactly does this threaded interface offer over MPI
> neighborhood collectives?  Neighborhood collectives coordinate access to
> the shared resource, so it's possible to get contiguous buffers of
> length N packed and unpacked in N/P+log(P) time, same as the best
> possible threaded implementation.  Better, by using MPI datatypes, the
> network can start moving bytes before all the packing has completed,
> enabling better performance for large message sizes and avoiding the
> need for huge contiguous send/recv buffers.  (Doing this with either of
> the thread packing schemes above implies more complicated control flow
> and more synchronization overhead.)  This approach is safer, it's easier
> to debug correctness and performance, and it works equally well with no
> code changes on non-coherent architectures [2].
> 
> If you really want to share something, you can do it via
> MPI_Win_allocate_shared.  For example, some people use large stencils
> and claim that copying those ghost cells is not viable in case of small
> subdomains.  So they want threads to operate on chunks of a larger
> shared multi-dimensional array.  Of course this tends to be less cache
> efficient because the chunks are non-contiguous, resulting in ragged
> cache lines and more TLB misses.  In fact, copying chunks into
> contiguous tiles is often considered a memory optimization and even done
> in serial, but if we take the assertions at face value, you're still
> welcome to share via MPI_Win_allocate_shared.
> 
> 
> MPI is certainly not perfect, but if we're going to get excited about
> threading or any other programming model, we need to understand exactly
> what we're giving up and what we're getting in return.
> 
> 1. Are we doing it because it's the best known solution to a
>   well-defined technical problem?  (What is that problem?)
> 
> 2. Are we doing it because we don't like MPI as a dependency?
> 
> 3. Are we doing it because vendors refuse to implement neighborhood
>   collectives, shared libraries, or some other technology that is good
>   when we have many processes?
> 
> 4. Are we doing it because legacy applications made choices (good or
>   bad; it's not our business) such that it would be easier for them to
>   use PETSc if we supported threads (via some interface)?
> 
> 5. Are we doing it to check a box drawn by management?
> 
> 
>>   Are "PETSc" solvers like GAMG suppose to deliver great performance across the whole range? 
> 
> It seems more likely that MPI+threads will perform worse for all problem
> sizes, unless vendors make no effort to provide on-node MPI and we don't
> compensate by using MPI_Win_allocate_shared (in which case performance
> should be similar to the best you could expect with MPI+threads).
> 
>>   Jed seems to be hinting at having a relatively small t possibly
>>   compared to the total number of hardware threads on the node. Is
>>   this correct Jed?  Could we assume in PETSc that it is always small
>>   (and thus some of the performance challenges are gone).
> 
> This is the crappy 2-level assumption.  If we don't want a 3-level (and
> 4-level, etc.) programming model, we need to make at least one level
> scalable.  And at that point, why have more than one level?  One
> argument is that exotic/side-channel coordination between hardware
> threads sharing L1 cache would be easier if the address spaces match.
> This might be true and would imply a small t, but this is an advanced
> special-purpose technique that is not widely used.
> 
> 
> [1] Assuming no coordination between processes, as in MPI_Isend/Irecv or
> MPI_Put/Get.  This results in P processes competing for a shared
> resource (the network device) and acquiring it sequentially.  That's a
> length-P critical path and we want log(P).
> 
> [2] Relaxing coherence is one of few ways to reduce latency for on-node
> data dependencies.  I'd hate to see a bunch of people write code that
> would prevent such architectures from appearing.