since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

Matthew Knepley knepley at
Sat Dec 5 16:50:38 CST 2009

On Sat, Dec 5, 2009 at 4:25 PM, Jed Brown <jed at> wrote:

> On Sat, 5 Dec 2009 16:02:38 -0600, Matthew Knepley <knepley at>
> wrote:
> > I need to understand better. You are asking about the case where we have
> > many GPUs and one CPU? If its always one or two GPUs per CPU I do not
> > see the problem.
> Barry initially proposed one Python thread per node, then distributing
> the kernels over many CPU cores on that node, or to one-or-more GPUs.
> With some abuse of terminology, lets call them all worker threads,
> perhaps dozens if running on multicore CPUs, or hundreds/thousands when
> on a GPU.  The physics, such as FEM integration, has to be done by those
> worker threads.  But unless every thread is it's own subdomain
> (i.e. Block Jacobi/ASM with very small subdomains), we still need to
> assemble a small number of matrices per node.  So we would need a
> lock-free concurrent MatSetValues, otherwise we'll only scale to a few
> worker threads before everything is blocked on MatSetValues.

I imagined that this kind of assembly will be handled similarly to what we
do in the FMM.
You assign a few threads per element to calculate the FEM integral. You
could maintain
this unassembled if you only need actions. However, if you want an actual
sparse matrix,
there are a couple of options

  1) Store the unassembled matrix, and run assembly after integration is
complete. This
       needs more memory, but should perform well.

  2) Use atmoic operations to update. I have not seen this yet, so I am
unsure how is will perform.

  3) Use some memory scheme (monitor) to update. This will have terrible

Can you think of other options?


> > Hmm, still not quite getting this problem. We need concurrency on the
> > GPU, but why would we need it on the CPU?
> Only if the we were doing real work on the many CPU cores per node.
> > On the GPU, triangular solve will be just as crappy as it currently
> > is, but will look even worse due to large number of cores.
> It could be worse because a single GPU thread is likely slower than a
> CPU core.
> > It is not the only smoother. For instance, polynomial smoothers would
> > be more concurrent.
> Yup.
> > > I have trouble finding decent preconditioning algorithms suitable for
> > > the fine granularity of GPUs.  Matt thinks we can get rid of all the
> > > crappy sparse matrix kernels and precondition everything with FMM.
> > >
> >
> > That is definitely my view, or at least my goal. And I would say this,
> > if we are just starting out on these things, I think it makes sense to
> > do the home runs first. If we just try and reproduce things, people
> > might say "That is nice, but I can already do that pretty well".
> Agreed, but it's also important to have something good to offer people
> who aren't ready to throw out everything they know and design a new
> algorithm based on a radically different approach that may or may not be
> any good for their physics.
> Jed

What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the petsc-dev mailing list