On Sat, Dec 5, 2009 at 6:01 PM, Jed Brown <span dir="ltr"><<a href="mailto:jed@59a2.org">jed@59a2.org</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">On Sat, 5 Dec 2009 16:50:38 -0600, Matthew Knepley <<a href="mailto:knepley@gmail.com">knepley@gmail.com</a>> wrote:<br>

> You assign a few threads per element to calculate the FEM<br>

> integral. You could maintain this unassembled if you only need<br>

> actions.<br>

<br>

</div>You can also store it with much less memory as just values at quadrature<br>

points.<br><div class="im"></div></blockquote><div><br>Depends on the quadrature, but yes there are sometimes better storage schemes<br>(especially if you have other properties like decay).<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">

> However, if you want an actual sparse matrix, there are a couple of<br>

> options<br>

><br>

><br>

>   1) Store the unassembled matrix, and run assembly after integration<br>

> is complete. This needs more memory, but should perform well.<br>

<br>

</div>Fine, but how is this assembly done?  If it's serial then it would be a<br>

bottleneck, so you still need the concurrent thing below.<br><div class="im"></div></blockquote><div><br>Vec assembly can be done on the CPU since its so small I think.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">

>   2) Use atmoic operations to update. I have not seen this yet, so I am<br>

> unsure how is will perform.<br>

<br>

</div>Atomic operations could be used per-entry but this costs on the order of<br>

100 cycles on CPUs.  I think newer GPUs have atomics, but I don't know<br>

the cost.  Presumably it's at least as much as the latency of a read<br>

from memory.<br></blockquote><div><br>Not sure. Needs to be explored. Felipe is working on it.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


When inserting in decent sized chunks, it would probably be worth taking<br>

per-row or larger locks to amortize the cost of the atomics.<br>

Additionally, you could statically partition the workload and only use<br>

atomics for rows/entries that were shared.<font color="#888888"><br></font></blockquote><div><br>You can use partitioning/coloring techniques to increase the lockless concurrency.<br><br>  Matt<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<font color="#888888">

Jed<br>

</font></blockquote></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener<br>