[petsc-dev] Hybrid MPI/OpenMP reflections

Barry Smith bsmith at mcs.anl.gov
Sat Aug 17 18:51:44 CDT 2013


   Michael,

    Thanks for reporting this. I have refactored the code to compute the nonzero rows count in MatAssemblyEnd_SeqAIJ() and use it when logging the flops so it will work with the threaded code. Also fixed for SeqBAIJ matrices. It is undergoing code review at https://bitbucket.org/petsc/petsc/pull-request/77/fix-matmult_seqaij-flop-count-and-simplify

   Barry


On Aug 9, 2013, at 8:31 AM, Michael Lange <michael.lange at imperial.ac.uk> wrote:

> Hi,
> 
> I just had a look at the threaded version of MatMult_SeqAIJ and I think the Flops logging might be incorrect, because the nonzerorows aren't counted in MatMult_SeqAIJ_Kernel. Fixing this in the thread kernel would require a reduction though, which could impact performance. Is this a known problem, or is there a better way to compute Flops, which doesn't require the nonzerorows?
> 
> Alternatively, would it make sense to pre-compute the nonzerorows and store them in the threadcomm? This might require more of the AIJ data structure to be exposed to PetscLayoutSetUp / PetscThreadCommGetOwnershipRanges though. 
> 
> Regards,
> Michael 
> 
> On 08/08/13 12:08, Matthew Knepley wrote:
>> On Thu, Aug 8, 2013 at 5:37 AM, Michael Lange <michael.lange at imperial.ac.uk> wrote:
>> Hi,
>> 
>> We have recently been trying to re-align our OpenMP fork (https://bitbucket.org/ggorman/petsc-3.3-omp) with petsc/master. Much of our early work has now been superseded by the threadcomm implementations. Nevertheless, there are still a few algorithmic differences between the two branches:
>> 
>> 1) Enforcing MPI latency hiding by using task-based spMV:
>> If the MPI implementation used does not actually provide truly asynchronous communication in hardware, performance can be increased by dedicating a single thread to overlapping MPI communication in PETSc. However, this is arguably a vendor-specific fix which requires significant code changes (ie the parallel section needs to be raised up by one level). So perhaps the strategy should be to give guilty vendors a hard time rather than messing up the current abstraction.
>> 
>> 2) Nonzero-based thread partitioning:
>> Rather than evenly dividing the number of rows among threads, we can partition the thread ownership ranges according to the number of non-zeros in each row. This balances the work load between threads and thus increases strong scalability due to optimised bandwidth utilisation. In general, this optimisation should integrate well with threadcomms, since it only changes the thread ownership ranges, but it does require some structural changes since nnz is currently not passed to PetscLayoutSetUp. Any thoughts on whether people regard such a scheme as useful would be greatly appreciated.
>> 
>> I think this should be handled by changing the AIJ data structure. Going all the way to "2D" partitions would also allow
>> us to handle power-law matrix graphs. This would keep the thread implementation simple, and at the same time be more
>> flexible.
>> 
>>    Matt
>>  
>> 3) MatMult_SeqBAIJ not threaded:
>> Is there a reason why MatMult has not been threaded for BAIJ matrices, or is somebody already working on this? If not, I would like to prepare a pull request for this using the same approach as MatMult_SeqAIJ.
>> 
>> We would welcome any suggestions/feedback on this, in particular the second point. Up to date benchmarking results for the first two methods, including BlueGene/Q, can be found in:
>> http://arxiv.org/abs/1307.4567
>> 
>> Kind regards,
>> 
>> Michael Lange
>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>> -- Norbert Wiener
> 




More information about the petsc-dev mailing list