[petsc-dev] Supporting OpenCL matrix assembly

Tue Sep 24 03:31:06 CDT 2013

>> This can obviously be done incrementally, so storing a batch of
>> element matrices to global memory is not a problem.
>
> If you store element matrices to global memory, you're using a ton of
> bandwidth (about 20x the size of the matrix if using P1 tets).
>
> What if you do the sort/reduce thing within thread blocks, and only
> write the reduced version to global storage?

My primary metric for GPU kernels is memory transfers from global memory 
('flops are free'), hence what I suggest for the assembly stage is to go 
with something CSR-like rather than COO. Pure CSR may be too expensive 
in terms of element lookup if there are several fields involved 
(particularly 3d), so one could push (column-index, value) pairs for 
each row and making the merge-by-key much cheaper than for arbitrary COO 
matrices.

This, of course, requires the knowledge of the nonzero pattern and 
couplings among elements, yet this is reasonably cheap to extract for a 
large number of problems (for example, (non)linear PDEs without 
adaptivity). Also, the nonzero pattern is rather cheap to obtain if one 
uses coloring for avoiding expensive atomic writes to global memory.

Best regards,
Karli