[petsc-dev] Hybrid MPI/OpenMP reflections

Thu Aug 8 09:14:46 CDT 2013

Hi Michael,

 > We have recently been trying to re-align our OpenMP fork
> (https://bitbucket.org/ggorman/petsc-3.3-omp) with petsc/master. Much of
> our early work has now been superseded by the threadcomm
> implementations. Nevertheless, there are still a few algorithmic
> differences between the two branches:
>
> 1) Enforcing MPI latency hiding by using task-based spMV:
> If the MPI implementation used does not actually provide truly
> asynchronous communication in hardware, performance can be increased by
> dedicating a single thread to overlapping MPI communication in PETSc.
> However, this is arguably a vendor-specific fix which requires
> significant code changes (ie the parallel section needs to be raised up
> by one level). So perhaps the strategy should be to give guilty vendors
> a hard time rather than messing up the current abstraction.

When using good preconditioners, spMV is essentially never the 
bottleneck and hence I don't think a separate communication thread 
should be implemented in PETSc. Instead, such a fallback should be part 
of a good MPI implementation.

> 2) Nonzero-based thread partitioning:
> Rather than evenly dividing the number of rows among threads, we can
> partition the thread ownership ranges according to the number of
> non-zeros in each row. This balances the work load between threads and
> thus increases strong scalability due to optimised bandwidth
> utilisation. In general, this optimisation should integrate well with
> threadcomms, since it only changes the thread ownership ranges, but it
> does require some structural changes since nnz is currently not passed
> to PetscLayoutSetUp. Any thoughts on whether people regard such a scheme
> as useful would be greatly appreciated.

This is a reasonable optimization, I used a similar strategy for sparse 
matrices on the GPU. Others should comment on whether the interface 
change to PetscLayoutSetUp is acceptable.

> 3) MatMult_SeqBAIJ not threaded:
> Is there a reason why MatMult has not been threaded for BAIJ matrices,
> or is somebody already working on this? If not, I would like to prepare
> a pull request for this using the same approach as MatMult_SeqAIJ.

To my knowledge, it 'simply hasn't been implemented yet'. A pull request 
would be nice, I'm happy to review.

Best regards,
Karli