[petsc-dev] PETSc and threads

Fri Jan 9 19:42:57 CST 2015

"Nystrom, William David" <wdn at lanl.gov> writes:

> Well, I would really like to be able to do the experiment with PETSc - and I tried to do
> so back in the summer of 2013.  But I encountered problems which I documented with
> the current PETSc threadcomm package trying a really simple problem with cg and
> jacobi preconditioning.  And I don't believe those problems have been fixed.  And I
> don't believe there is any intention of fixing them with the current threadcomm package.
> So I can't do any meaningful experiments with PETSc related to MPI+threads.

Dave, getting objectively good performance with threads is hard.  A lot
of people try and fail, including Intel engineers trying to optimize
just one code.  The reason is that MPI+OpenMP is a crappy programming
model, especially the way it is usually used (which puts absurdly
expensive things like "omp parallel" in the critical path).  I don't
want to ship finicky crap that runs slower for most users, but the fact
is that the community does not know how to make MPI+OpenMP fast for
interesting problems.

I cite HPGMG-FV as an example because Sam understands hardware well and
conceived that code from the ground up for threads, yet it executes
faster with MPI on most machines at all problem sizes.

I posit that most examples of threads making a PDE solver faster are due
to poor use of MPI, poor choice of algorithm, or contrived
configuration.  I want to make the science and engineering that matters
faster, not check a box saying that we "do threads".

> Regarding HPGMG-FV, I never heard of it 

It is the finite volume version of our multigrid benchmark.

  https://hpgmg.org

> and have no idea whether it could be used in an ASC code to do the
> linear solves.  

It's a benchmark, not a library.  But it is representative of multigrid
solvers.  If you can't make it run faster using threads, there's no
point trying to use threads in PETSc if you're most concerned about
solving real problems as fast as possible.

> I have also had some recent experience running a plasma simulation code called VPIC
> on Blue Gene Q with flat MPI and MPI+pthreads.  When I run with MPI+pthreads on
> Blue Gene Q, VPIC is noticeably faster even though I can only run with 3 threads per
> rank but can run in flat MPI mode with 4 ranks per core.  

This is an anecdote.  If you can explain why, we can have a productive
conversation.  Otherwise it's Just Run Shit® and not useful to inform a
versatile library.

As I mentioned before, some apps may have made decisions that make
threads more _usable_ to them.  If that's the issue, let's have a
conversation about usability, not about solver performance.  If solver
performance is the first priority, we need to understand the fundamental
limitations of each choice.

> BTW, if you have references that document experiments comparing performance of
> flat MPI with MPI+threads, I would be happy to read them.

https://hpgmg.org/lists/archives/hpgmg-forum/2014-August/000091.html

Exploring Shared-memory Optimizations for an Unstructured Mesh CFD
Application on Modern Parallel Systems, Dheevatsa Mudigere, Srinivas
Sridharan, Anand Deshpande, Jongsoo Park, Alexander Heinecke, Mikhail
Smelyanskiy, Bharat Kaul, Pradeep Dubey, Dinesh Kaushik, and David
Keyes, IEEE International Parallel & Distributed Processing Symposium
(IPDPS), 2015, accepted for publication

https://www2.cisl.ucar.edu/sites/default/files/maynard_5a.pdf

https://www2.cisl.ucar.edu/sites/default/files/durachta_4.pdf

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150109/c70cc6c4/attachment.sig>