since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

Sat Dec 5 18:01:29 CST 2009

On Sat, 5 Dec 2009 16:50:38 -0600, Matthew Knepley <knepley at gmail.com> wrote:
> You assign a few threads per element to calculate the FEM
> integral. You could maintain this unassembled if you only need
> actions.

You can also store it with much less memory as just values at quadrature
points.

> However, if you want an actual sparse matrix, there are a couple of
> options
>
> 
>   1) Store the unassembled matrix, and run assembly after integration
> is complete. This needs more memory, but should perform well.

Fine, but how is this assembly done?  If it's serial then it would be a
bottleneck, so you still need the concurrent thing below.

>   2) Use atmoic operations to update. I have not seen this yet, so I am
> unsure how is will perform.

Atomic operations could be used per-entry but this costs on the order of
100 cycles on CPUs.  I think newer GPUs have atomics, but I don't know
the cost.  Presumably it's at least as much as the latency of a read
from memory.

When inserting in decent sized chunks, it would probably be worth taking
per-row or larger locks to amortize the cost of the atomics.
Additionally, you could statically partition the workload and only use
atomics for rows/entries that were shared.

Jed