[petsc-dev] PETSc and threads

Sat Jan 17 18:14:58 CST 2015

Jed Brown writes:
 > "Nystrom, William David" <wdn at lanl.gov> writes:
 > 
 > > Well, I would really like to be able to do the experiment with PETSc -
 > > and I tried to do so back in the summer of 2013.  But I encountered
 > > problems which I documented with the current PETSc threadcomm package
 > > trying a really simple problem with cg and jacobi preconditioning.  And
 > > I don't believe those problems have been fixed.  And I don't believe
 > > there is any intention of fixing them with the current threadcomm
 > > package.  So I can't do any meaningful experiments with PETSc related to
 > > MPI+threads.
 > 
 > Dave, getting objectively good performance with threads is hard.  A lot
 > of people try and fail, including Intel engineers trying to optimize
 > just one code.  The reason is that MPI+OpenMP is a crappy programming
 > model, especially the way it is usually used (which puts absurdly
 > expensive things like "omp parallel" in the critical path).  I don't
 > want to ship finicky crap that runs slower for most users, but the fact
 > is that the community does not know how to make MPI+OpenMP fast for
 > interesting problems.

When you say getting good performance with threads is hard, do you mean for
complicated preconditioners like multigrid and incomplete factorization
methods?  Or do you mean that it is hard to write a good cg solver with
simple preconditioning methods like jacobi and block jacobi?

You say the reason is because "MPI+OpenMP is a crappy programming model".
What about MPI+Pthreads?

 > I cite HPGMG-FV as an example because Sam understands hardware well and
 > conceived that code from the ground up for threads, yet it executes
 > faster with MPI on most machines at all problem sizes.
 > 
 > I posit that most examples of threads making a PDE solver faster are due
 > to poor use of MPI, poor choice of algorithm, or contrived
 > configuration.  I want to make the science and engineering that matters
 > faster, not check a box saying that we "do threads".

Well, so do I - to your last sentence.  But is it really possible to run 300+
MPI ranks on a single node as efficiently as running a single rank on a node
plus 300+ threads - where the threads are pthreads or perhaps a special light
weight thread?  That is an honest question, not a rhetorical one because I
don't really know how light weight a vendor could make MPI ranks on a node
versus threads on a node.  And it seems that the number of threads or MPI
processes per node is going to continue to get larger and larger as the march
to exascale continues.  A lot of people seem to think we will need MPI+X with
MPI just being used between nodes.  I don't know who is right but it makes me
nervous to have to rely on a good vendor MPI in order to get good performance
on these new machines.  Just like I don't feel that I can rely on the
compiler to optimize my code well enough to get the best performance on a
node.

 > > Regarding HPGMG-FV, I never heard of it 
 > 
 > It is the finite volume version of our multigrid benchmark.
 > 
 >   https://hpgmg.org
 > 
 > > and have no idea whether it could be used in an ASC code to do the
 > > linear solves.  
 > 
 > It's a benchmark, not a library.  But it is representative of multigrid
 > solvers.  If you can't make it run faster using threads, there's no
 > point trying to use threads in PETSc if you're most concerned about
 > solving real problems as fast as possible.
 > 
 > > I have also had some recent experience running a plasma simulation code
 > > called VPIC on Blue Gene Q with flat MPI and MPI+pthreads.  When I run
 > > with MPI+pthreads on Blue Gene Q, VPIC is noticeably faster even though
 > > I can only run with 3 threads per rank but can run in flat MPI mode with
 > > 4 ranks per core.
 > 
 > This is an anecdote.  If you can explain why, we can have a productive
 > conversation.  Otherwise it's Just Run Shit® and not useful to inform a
 > versatile library.

It is a data point that makes me want to be able to explore MPI+Pthreads
more.  I have not worked with VPIC enough to be able to provide an
explanation for why MPI+Pthreads was faster than MPI alone.  But I believe
the MPI for Blue Gene Q is supposed to be pretty good.

 > As I mentioned before, some apps may have made decisions that make
 > threads more _usable_ to them.  If that's the issue, let's have a
 > conversation about usability, not about solver performance.  If solver
 > performance is the first priority, we need to understand the fundamental
 > limitations of each choice.
 > 
 > > BTW, if you have references that document experiments comparing
 > > performance of flat MPI with MPI+threads, I would be happy to read them.
 > 
 > https://hpgmg.org/lists/archives/hpgmg-forum/2014-August/000091.html
 > 
 > Exploring Shared-memory Optimizations for an Unstructured Mesh CFD
 > Application on Modern Parallel Systems, Dheevatsa Mudigere, Srinivas
 > Sridharan, Anand Deshpande, Jongsoo Park, Alexander Heinecke, Mikhail
 > Smelyanskiy, Bharat Kaul, Pradeep Dubey, Dinesh Kaushik, and David
 > Keyes, IEEE International Parallel & Distributed Processing Symposium
 > (IPDPS), 2015, accepted for publication
 > 
 > https://www2.cisl.ucar.edu/sites/default/files/maynard_5a.pdf
 > 
 > https://www2.cisl.ucar.edu/sites/default/files/durachta_4.pdf