[petsc-dev] Hybrid MPI/OpenMP reflections

Fri Aug 9 08:31:45 CDT 2013

Hi,

I just had a look at the threaded version of MatMult_SeqAIJ and I think 
the Flops logging might be incorrect, because the nonzerorows aren't 
counted in MatMult_SeqAIJ_Kernel. Fixing this in the thread kernel would 
require a reduction though, which could impact performance. Is this a 
known problem, or is there a better way to compute Flops, which doesn't 
require the nonzerorows?

Alternatively, would it make sense to pre-compute the nonzerorows and 
store them in the threadcomm? This might require more of the AIJ data 
structure to be exposed to PetscLayoutSetUp / 
PetscThreadCommGetOwnershipRanges though.

Regards,
Michael

On 08/08/13 12:08, Matthew Knepley wrote:
> On Thu, Aug 8, 2013 at 5:37 AM, Michael Lange 
> <michael.lange at imperial.ac.uk <mailto:michael.lange at imperial.ac.uk>> 
> wrote:
>
>     Hi,
>
>     We have recently been trying to re-align our OpenMP fork
>     (https://bitbucket.org/ggorman/petsc-3.3-omp) with petsc/master.
>     Much of our early work has now been superseded by the threadcomm
>     implementations. Nevertheless, there are still a few algorithmic
>     differences between the two branches:
>
>     1) Enforcing MPI latency hiding by using task-based spMV:
>     If the MPI implementation used does not actually provide truly
>     asynchronous communication in hardware, performance can be
>     increased by dedicating a single thread to overlapping MPI
>     communication in PETSc. However, this is arguably a
>     vendor-specific fix which requires significant code changes (ie
>     the parallel section needs to be raised up by one level). So
>     perhaps the strategy should be to give guilty vendors a hard time
>     rather than messing up the current abstraction.
>
>     2) Nonzero-based thread partitioning:
>     Rather than evenly dividing the number of rows among threads, we
>     can partition the thread ownership ranges according to the number
>     of non-zeros in each row. This balances the work load between
>     threads and thus increases strong scalability due to optimised
>     bandwidth utilisation. In general, this optimisation should
>     integrate well with threadcomms, since it only changes the thread
>     ownership ranges, but it does require some structural changes
>     since nnz is currently not passed to PetscLayoutSetUp. Any
>     thoughts on whether people regard such a scheme as useful would be
>     greatly appreciated.
>
>
> I think this should be handled by changing the AIJ data structure. 
> Going all the way to "2D" partitions would also allow
> us to handle power-law matrix graphs. This would keep the thread 
> implementation simple, and at the same time be more
> flexible.
>
>    Matt
>
>     3) MatMult_SeqBAIJ not threaded:
>     Is there a reason why MatMult has not been threaded for BAIJ
>     matrices, or is somebody already working on this? If not, I would
>     like to prepare a pull request for this using the same approach as
>     MatMult_SeqAIJ.
>
>     We would welcome any suggestions/feedback on this, in particular
>     the second point. Up to date benchmarking results for the first
>     two methods, including BlueGene/Q, can be found in:
>     http://arxiv.org/abs/1307.4567
>
>     Kind regards,
>
>     Michael Lange
>
>
>
>
> -- 
> What most experimenters take for granted before they begin their 
> experiments is infinitely more interesting than any results to which 
> their experiments lead.
> -- Norbert Wiener

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130809/6b993cdb/attachment.html>