[petsc-dev] PETSc and threads

Fri Jan 9 13:06:56 CST 2015

  Whenever one speaks about MPI+threads there is a question of how many threads t per MPI process one is talking about. (An equivalent way of stating this is how many MPI processes has per node).

   Is t 2? 

    is t 4, 

   is it the number of hardware threads on a "single memory socket"

   it is the number of hardware threads on a "CPU"

   is it the number of hardware threads on the entire node? 

   Depending on this one has very different challenges for getting the best performance.

   Are "PETSc" solvers like GAMG suppose to deliver great performance across the whole range? 

   Jed seems to be hinting at having a relatively small t possibly compared to the total number of hardware threads on the node. Is this correct Jed?  Could we assume in PETSc that it is always small (and thus some of the performance challenges are gone).

   Barry

> On Jan 9, 2015, at 9:44 AM, Jed Brown <jed at jedbrown.org> wrote:
> 
> Mark Adams <mfadams at lbl.gov> writes:
>> No this is me.  They will probably have about 30K (2D linear FE) equations
>> per 40 Tflop node.  10% (4 Tflops) is too much resources for 30K equations
>> as it is.  No need to try utilize the GPU as far as I can see.
> 
> With multiple POWER9 sockets per node, you have to deal with NUMA and
> separate caches.  The rest of the application is not going to do this
> with threads, so you'll have multiple MPI processes anyway.  The entire
> problem will fit readily in L2 cache and you have a latency problem on
> the CPU alone.  Ask them to make neighborhood collectives fast.