[petsc-dev] PETSc and threads

Fri Jan 9 23:18:37 CST 2015

Barry Smith <bsmith at mcs.anl.gov> writes:

>> On Jan 9, 2015, at 8:21 PM, Jed Brown <jed at jedbrown.org> wrote:
>> 
>> Barry Smith <bsmith at mcs.anl.gov> writes:
>> 
>>>     Just say: "I support the pure MPI model on multicore systems
>>>     including KNL" if that is the case or say what you do support; 
>> 
>> I support that (with neighborhood collectives and perhaps
>> MPI_Win_allocate_shared) if they provide a decent MPI implementation.  
>                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>    Damn it, you still cannot just make a statement without putting in caveats!

Vendors of proprietary networks don't have meaningful software
competition, therefore apathy for a particular feature can ruin its
viability no matter how well the hardware could support it.  The only
reliable way to make something work is for it to be in the acceptance
tests for a major procurement.  And even then it may not be
competitively good, just good enough to pass the test.  I'm not aware of
neighborhood collectives being part of any major procurement, so history
says it's likely to either not work or be slower than the reference
implementation in MPICH.  We can try to buck that trend by convincing
them that it's important, but much like politicians, it appears that
importance is exclusively measured in dollars.

>    Back to the question of how we respond to Intel's queries. Could the text basically start as 
>
>    Our plan to improve PETSc SOLVERS performance for KNL systems is to
>    provide an implementation of VecScatter, PetscSF, VecAssembly
>    communication and MatAssembly communication using MPI 3
>    neighborhood collectives. This will also provide improvement of
>    PETSc performance on Intel HSW systems. Thus PETSc application
>    codes and frameworks such as MOOSE would portably port over to the
>    KNL systems without major or even minor rewrites by the application
>    developers.

Last I heard, MOOSE usually runs with a replicated mesh, so they use
threads for memory reasons.  If the mesh was either distributed or
placed in MPI_Win_allocate_shared memory, they could use 1 thread per
process.

>   The use of MPI 3 neighborhood collectives will allow PETSc to take
>   advantage of the local memories available in the KNL and will avoid
>   all the large overhead of OpenMP parallel region launches .... blah
>   blah blah with specific data about Intel not doing better than MPI
>   only and timings for OpenMP regions.....
>
>    Make sense? Or do I still not understand what you are saying?

It makes sense, but it may not make them happy.

> Say they don't do the neighborhood collectives and perhaps
> MPI_Win_allocate_shared properly; in fact they don't do it at all or
> they just have terrible performance. What about using the "light
> weight" MPI model of the RPI guys? Have a single MPI process per node
> but spawn all the threads at the beginning and have each of these
> threads run like "light weight" MPI processes from the user
> perspective able to make (a subset of) MPI calls

If this is with MPI_THREAD_MULTIPLE, then it's a nonstarter in terms of
scalability.  If it uses MPI "endpoints" (a concept the Forum is
discussing) then I think it's totally viable.  It would be possible to
wrap MPI_Comm and implement whatever semantics we want, but wrapping
MPI_Comm is nasty.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150109/53bd3d6f/attachment.sig>