[petsc-dev] Hybrid MPI/OpenMP reflections

Matthew Knepley knepley at gmail.com
Thu Aug 8 06:08:28 CDT 2013

On Thu, Aug 8, 2013 at 5:37 AM, Michael Lange
<michael.lange at imperial.ac.uk>wrote:

> Hi,
> We have recently been trying to re-align our OpenMP fork (
> https://bitbucket.org/**ggorman/petsc-3.3-omp<https://bitbucket.org/ggorman/petsc-3.3-omp>)
> with petsc/master. Much of our early work has now been superseded by the
> threadcomm implementations. Nevertheless, there are still a few algorithmic
> differences between the two branches:
> 1) Enforcing MPI latency hiding by using task-based spMV:
> If the MPI implementation used does not actually provide truly
> asynchronous communication in hardware, performance can be increased by
> dedicating a single thread to overlapping MPI communication in PETSc.
> However, this is arguably a vendor-specific fix which requires significant
> code changes (ie the parallel section needs to be raised up by one level).
> So perhaps the strategy should be to give guilty vendors a hard time rather
> than messing up the current abstraction.
> 2) Nonzero-based thread partitioning:
> Rather than evenly dividing the number of rows among threads, we can
> partition the thread ownership ranges according to the number of non-zeros
> in each row. This balances the work load between threads and thus increases
> strong scalability due to optimised bandwidth utilisation. In general, this
> optimisation should integrate well with threadcomms, since it only changes
> the thread ownership ranges, but it does require some structural changes
> since nnz is currently not passed to PetscLayoutSetUp. Any thoughts on
> whether people regard such a scheme as useful would be greatly appreciated.

I think this should be handled by changing the AIJ data structure. Going
all the way to "2D" partitions would also allow
us to handle power-law matrix graphs. This would keep the thread
implementation simple, and at the same time be more


> 3) MatMult_SeqBAIJ not threaded:
> Is there a reason why MatMult has not been threaded for BAIJ matrices, or
> is somebody already working on this? If not, I would like to prepare a pull
> request for this using the same approach as MatMult_SeqAIJ.
> We would welcome any suggestions/feedback on this, in particular the
> second point. Up to date benchmarking results for the first two methods,
> including BlueGene/Q, can be found in:
> http://arxiv.org/abs/1307.4567
> Kind regards,
> Michael Lange

