On Sat, Dec 5, 2009 at 4:25 PM, Jed Brown <span dir="ltr"><<a href="mailto:jed@59a2.org">jed@59a2.org</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">On Sat, 5 Dec 2009 16:02:38 -0600, Matthew Knepley <<a href="mailto:knepley@gmail.com">knepley@gmail.com</a>> wrote:<br>

> I need to understand better. You are asking about the case where we have<br>

> many GPUs and one CPU? If its always one or two GPUs per CPU I do not<br>

> see the problem.<br>

<br>

</div>Barry initially proposed one Python thread per node, then distributing<br>

the kernels over many CPU cores on that node, or to one-or-more GPUs.<br>

With some abuse of terminology, lets call them all worker threads,<br>

perhaps dozens if running on multicore CPUs, or hundreds/thousands when<br>

on a GPU.  The physics, such as FEM integration, has to be done by those<br>

worker threads.  But unless every thread is it's own subdomain<br>

(i.e. Block Jacobi/ASM with very small subdomains), we still need to<br>

assemble a small number of matrices per node.  So we would need a<br>

lock-free concurrent MatSetValues, otherwise we'll only scale to a few<br>

worker threads before everything is blocked on MatSetValues.<br><div class="im"></div></blockquote><div><br>I imagined that this kind of assembly will be handled similarly to what we do in the FMM.<br>You assign a few threads per element to calculate the FEM integral. You could maintain<br>

this unassembled if you only need actions. However, if you want an actual sparse matrix,<br>there are a couple of options<br><br>  1) Store the unassembled matrix, and run assembly after integration is complete. This<br>       needs more memory, but should perform well.<br>

<br>  2) Use atmoic operations to update. I have not seen this yet, so I am unsure how is will perform.<br><br>  3) Use some memory scheme (monitor) to update. This will have terrible performance.<br><br>Can you think of other options?<br>

<br>   Matt<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">

> Hmm, still not quite getting this problem. We need concurrency on the<br>

> GPU, but why would we need it on the CPU?<br>

<br>

</div>Only if the we were doing real work on the many CPU cores per node.<br>

<div class="im"><br>

> On the GPU, triangular solve will be just as crappy as it currently<br>

> is, but will look even worse due to large number of cores.<br>

<br>

</div>It could be worse because a single GPU thread is likely slower than a<br>

CPU core.<br>

<div class="im"><br>

> It is not the only smoother. For instance, polynomial smoothers would<br>

> be more concurrent.<br>

<br>

</div>Yup.<br>

<div class="im"><br>

> > I have trouble finding decent preconditioning algorithms suitable for<br>

> > the fine granularity of GPUs.  Matt thinks we can get rid of all the<br>

> > crappy sparse matrix kernels and precondition everything with FMM.<br>

> ><br>

><br>

> That is definitely my view, or at least my goal. And I would say this,<br>

> if we are just starting out on these things, I think it makes sense to<br>

> do the home runs first. If we just try and reproduce things, people<br>

> might say "That is nice, but I can already do that pretty well".<br>

<br>

</div>Agreed, but it's also important to have something good to offer people<br>

who aren't ready to throw out everything they know and design a new<br>

algorithm based on a radically different approach that may or may not be<br>

any good for their physics.<br>

<font color="#888888"><br>

Jed<br>

</font></blockquote></div><br><br clear="all"><br>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener<br>