[petsc-dev] PETSc and threads

Fri Jan 9 23:33:14 CST 2015

> On Jan 9, 2015, at 11:18 PM, Jed Brown <jed at jedbrown.org> wrote:
> 
> Barry Smith <bsmith at mcs.anl.gov> writes:
> 
>>> On Jan 9, 2015, at 8:21 PM, Jed Brown <jed at jedbrown.org> wrote:
>>> 
>>> Barry Smith <bsmith at mcs.anl.gov> writes:
>>> 
>>>>    Just say: "I support the pure MPI model on multicore systems
>>>>    including KNL" if that is the case or say what you do support; 
>>> 
>>> I support that (with neighborhood collectives and perhaps
>>> MPI_Win_allocate_shared) if they provide a decent MPI implementation.  
>>                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> 
>>   Damn it, you still cannot just make a statement without putting in caveats!
> 
> Vendors of proprietary networks don't have meaningful software
> competition, therefore apathy for a particular feature can ruin its
> viability no matter how well the hardware could support it.  The only
> reliable way to make something work is for it to be in the acceptance
> tests for a major procurement.  And even then it may not be
> competitively good, just good enough to pass the test.  I'm not aware of
> neighborhood collectives being part of any major procurement, so history
> says it's likely to either not work or be slower than the reference
> implementation in MPICH.  We can try to buck that trend by convincing
> them that it's important, but much like politicians, it appears that
> importance is exclusively measured in dollars.
> 
>>   Back to the question of how we respond to Intel's queries. Could the text basically start as 
>> 
>>   Our plan to improve PETSc SOLVERS performance for KNL systems is to
>>   provide an implementation of VecScatter, PetscSF, VecAssembly
>>   communication and MatAssembly communication using MPI 3
>>   neighborhood collectives. This will also provide improvement of
>>   PETSc performance on Intel HSW systems. Thus PETSc application
>>   codes and frameworks such as MOOSE would portably port over to the
>>   KNL systems without major or even minor rewrites by the application
>>   developers.
> 
> Last I heard, MOOSE usually runs with a replicated mesh, so they use
> threads for memory reasons.  

   In other words you are saying that libMesh is not MPI scalable? Well then MOOSE could switch to Deal.ii for their mesh and finite element library :-) (Assuming that it is MPI scalable). otherwise those damn mesh libraries better become MPI scalable, its not that freaking hard.

> If the mesh was either distributed or
> placed in MPI_Win_allocate_shared memory, they could use 1 thread per
> process.
> 
>>  The use of MPI 3 neighborhood collectives will allow PETSc to take
>>  advantage of the local memories available in the KNL and will avoid
>>  all the large overhead of OpenMP parallel region launches .... blah
>>  blah blah with specific data about Intel not doing better than MPI
>>  only and timings for OpenMP regions.....
>> 
>>   Make sense? Or do I still not understand what you are saying?
> 
> It makes sense, but it may not make them happy.

  Do you think I give a flying fuck if it makes them happy?

   They've asked us what PETSc's plans are and how they can help us. Well you need to articulate your plan and tell them what they need to do to help them. If they don't like your plan or refuse to help with your plan then they need to state that in writing. Look, PETSc totally ignored the shared memory crazy of the late 90's (when vendors starting putting 2 or more CPUs on the same mother board with a shared memory card) and many other people wasted their time futzing around with OpenMP (David Keyes and Dinesh for example) generating lots of papers but no great performance, maybe this is a repeat of that and maybe we already wasted too much time on the topic this time round with threadcomm etc. Don't worry about what DOE program managers want or NERSc managers want, worry about what is right technically.

> 
>> Say they don't do the neighborhood collectives and perhaps
>> MPI_Win_allocate_shared properly; in fact they don't do it at all or
>> they just have terrible performance. What about using the "light
>> weight" MPI model of the RPI guys? Have a single MPI process per node
>> but spawn all the threads at the beginning and have each of these
>> threads run like "light weight" MPI processes from the user
>> perspective able to make (a subset of) MPI calls
> 
> If this is with MPI_THREAD_MULTIPLE, then it's a nonstarter in terms of
> scalability.  If it uses MPI "endpoints" (a concept the Forum is
> discussing) then I think it's totally viable.  It would be possible to
> wrap MPI_Comm and implement whatever semantics we want, but wrapping
> MPI_Comm is nasty.