[petsc-dev] Hybrid MPI/OpenMP reflections
Michael Lange
michael.lange at imperial.ac.uk
Fri Aug 9 08:31:45 CDT 2013
Hi,
I just had a look at the threaded version of MatMult_SeqAIJ and I think
the Flops logging might be incorrect, because the nonzerorows aren't
counted in MatMult_SeqAIJ_Kernel. Fixing this in the thread kernel would
require a reduction though, which could impact performance. Is this a
known problem, or is there a better way to compute Flops, which doesn't
require the nonzerorows?
Alternatively, would it make sense to pre-compute the nonzerorows and
store them in the threadcomm? This might require more of the AIJ data
structure to be exposed to PetscLayoutSetUp /
PetscThreadCommGetOwnershipRanges though.
Regards,
Michael
On 08/08/13 12:08, Matthew Knepley wrote:
> On Thu, Aug 8, 2013 at 5:37 AM, Michael Lange
> <michael.lange at imperial.ac.uk <mailto:michael.lange at imperial.ac.uk>>
> wrote:
>
> Hi,
>
> We have recently been trying to re-align our OpenMP fork
> (https://bitbucket.org/ggorman/petsc-3.3-omp) with petsc/master.
> Much of our early work has now been superseded by the threadcomm
> implementations. Nevertheless, there are still a few algorithmic
> differences between the two branches:
>
> 1) Enforcing MPI latency hiding by using task-based spMV:
> If the MPI implementation used does not actually provide truly
> asynchronous communication in hardware, performance can be
> increased by dedicating a single thread to overlapping MPI
> communication in PETSc. However, this is arguably a
> vendor-specific fix which requires significant code changes (ie
> the parallel section needs to be raised up by one level). So
> perhaps the strategy should be to give guilty vendors a hard time
> rather than messing up the current abstraction.
>
> 2) Nonzero-based thread partitioning:
> Rather than evenly dividing the number of rows among threads, we
> can partition the thread ownership ranges according to the number
> of non-zeros in each row. This balances the work load between
> threads and thus increases strong scalability due to optimised
> bandwidth utilisation. In general, this optimisation should
> integrate well with threadcomms, since it only changes the thread
> ownership ranges, but it does require some structural changes
> since nnz is currently not passed to PetscLayoutSetUp. Any
> thoughts on whether people regard such a scheme as useful would be
> greatly appreciated.
>
>
> I think this should be handled by changing the AIJ data structure.
> Going all the way to "2D" partitions would also allow
> us to handle power-law matrix graphs. This would keep the thread
> implementation simple, and at the same time be more
> flexible.
>
> Matt
>
> 3) MatMult_SeqBAIJ not threaded:
> Is there a reason why MatMult has not been threaded for BAIJ
> matrices, or is somebody already working on this? If not, I would
> like to prepare a pull request for this using the same approach as
> MatMult_SeqAIJ.
>
> We would welcome any suggestions/feedback on this, in particular
> the second point. Up to date benchmarking results for the first
> two methods, including BlueGene/Q, can be found in:
> http://arxiv.org/abs/1307.4567
>
> Kind regards,
>
> Michael Lange
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which
> their experiments lead.
> -- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130809/6b993cdb/attachment.html>
More information about the petsc-dev
mailing list