[petsc-dev] Supporting OpenCL matrix assembly

Matthew Knepley knepley at gmail.com
Tue Sep 24 05:36:30 CDT 2013


On Tue, Sep 24, 2013 at 2:45 AM, Jed Brown <jedbrown at mcs.anl.gov> wrote:

> Karl Rupp <rupp at mcs.anl.gov> writes:
>
> >>> This can obviously be done incrementally, so storing a batch of
> >>> element matrices to global memory is not a problem.
> >>
> >> If you store element matrices to global memory, you're using a ton of
> >> bandwidth (about 20x the size of the matrix if using P1 tets).
> >>
> >> What if you do the sort/reduce thing within thread blocks, and only
> >> write the reduced version to global storage?
> >
> > My primary metric for GPU kernels is memory transfers from global memory
> > ('flops are free'), hence what I suggest for the assembly stage is to go
> > with something CSR-like rather than COO. Pure CSR may be too expensive
> > in terms of element lookup if there are several fields involved
> > (particularly 3d), so one could push (column-index, value) pairs for
> > each row and making the merge-by-key much cheaper than for arbitrary COO
> > matrices.
>
> I think CSR vs. COO is a second-order optimization to be considered
> after the 20x redundancy has been eliminated and a synchronization
> strategy has been chosen (e.g., coloring vs redundant storage and later
> compression).


Yes. I do not understand Karl's suggestion about CSR/COO. My take-away from
Owens'
talk at Brown was that synchronization is too expensive/complex and that we
should
always do redundant storage+compression.

Please please please lets have an example where this takes > 5% of
simulation time. I
do not really believe it is alright to work on something that takes < 50%.

   Matt


> > This, of course, requires the knowledge of the nonzero pattern and
> > couplings among elements, yet this is reasonably cheap to extract for a
> > large number of problems (for example, (non)linear PDEs without
> > adaptivity). Also, the nonzero pattern is rather cheap to obtain if one
> > uses coloring for avoiding expensive atomic writes to global memory.
>
> At this point, I don't mind having the nonzero pattern set ahead of time
> using CPU code.  It's reassembly in time-dependent problems with no
> adaptivity or occasional adaptivity that I'm more concerned with.
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130924/79407179/attachment.html>


More information about the petsc-dev mailing list