[petsc-dev] Supporting OpenCL matrix assembly
Karl Rupp
rupp at mcs.anl.gov
Tue Sep 24 03:31:06 CDT 2013
>> This can obviously be done incrementally, so storing a batch of
>> element matrices to global memory is not a problem.
>
> If you store element matrices to global memory, you're using a ton of
> bandwidth (about 20x the size of the matrix if using P1 tets).
>
> What if you do the sort/reduce thing within thread blocks, and only
> write the reduced version to global storage?
My primary metric for GPU kernels is memory transfers from global memory
('flops are free'), hence what I suggest for the assembly stage is to go
with something CSR-like rather than COO. Pure CSR may be too expensive
in terms of element lookup if there are several fields involved
(particularly 3d), so one could push (column-index, value) pairs for
each row and making the merge-by-key much cheaper than for arbitrary COO
matrices.
This, of course, requires the knowledge of the nonzero pattern and
couplings among elements, yet this is reasonably cheap to extract for a
large number of problems (for example, (non)linear PDEs without
adaptivity). Also, the nonzero pattern is rather cheap to obtain if one
uses coloring for avoiding expensive atomic writes to global memory.
Best regards,
Karli
More information about the petsc-dev
mailing list