[petsc-dev] Supporting OpenCL matrix assembly

Mon Sep 23 16:46:03 CDT 2013

Matthew Knepley <knepley at gmail.com> writes:

> Okay, here is how I understand GPU matrix assembly. The only way it
> makes sense to me is in COO format which you may later convert. In
> mpiaijAssemble.cu I have code that
>
>   - Produces COO rows
>   - Segregates them into on and off-process rows

These users compute redundantly and set MAT_NO_OFF_PROC_ENTRIES.

>   - Sorts and reduces by key

... then insert into diagonal and off-diagonal parts of owned matrices.

> This can obviously be done incrementally, so storing a batch of
> element matrices to global memory is not a problem. 

If you store element matrices to global memory, you're using a ton of
bandwidth (about 20x the size of the matrix if using P1 tets).

What if you do the sort/reduce thing within thread blocks, and only
write the reduced version to global storage?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130923/43397e18/attachment.sig>