since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

Barry Smith bsmith at
Sat Dec 5 20:29:22 CST 2009

On Dec 5, 2009, at 4:25 PM, Jed Brown wrote:

> On Sat, 5 Dec 2009 16:02:38 -0600, Matthew Knepley  
> <knepley at> wrote:
>> I need to understand better. You are asking about the case where we  
>> have
>> many GPUs and one CPU? If its always one or two GPUs per CPU I do not
>> see the problem.
> Barry initially proposed one Python thread per node, then distributing
> the kernels over many CPU cores on that node, or to one-or-more GPUs.
> With some abuse of terminology, lets call them all worker threads,
> perhaps dozens if running on multicore CPUs, or hundreds/thousands  
> when
> on a GPU.  The physics, such as FEM integration, has to be done by  
> those
> worker threads.  But unless every thread is it's own subdomain
> (i.e. Block Jacobi/ASM with very small subdomains), we still need to
> assemble a small number of matrices per node.  So we would need a
> lock-free concurrent MatSetValues, otherwise we'll only scale to a few
> worker threads before everything is blocked on MatSetValues.

    Likely we would need a hierarchical matrix storage format where  
each "thread" has its own little "thread part" of the "node part" of  
the global matrix. For matrix entries generated by one thread "owned"  
by another thread entry stashing could done in a way similar to how it  
is done between nodes. That is, don't have a single matrix data  
structure shared by all threads, each thread has its own matrix data  

>> Hmm, still not quite getting this problem. We need concurrency on the
>> GPU, but why would we need it on the CPU?
> Only if the we were doing real work on the many CPU cores per node.

    Yes, absolutely one needs concurrency on the CPU cores in the node  
as well as with GPUs. It would nice to think that GPUs will become so  
much faster than traditional CPUs that we can forget traditional CPUs,  
but I think that is a dangerous assumption. Traditional CPU developers  
are not going to sit on their butts, they will be driven to do a much  
better job by the better GPUs.  The best thing that ever happened to  
active set methods (like simplex) was the development of interior  
point methods, this caused a dramatic improvement in active set  
methods speed.


>> On the GPU, triangular solve will be just as crappy as it currently
>> is, but will look even worse due to large number of cores.
> It could be worse because a single GPU thread is likely slower than a
> CPU core.
>> It is not the only smoother. For instance, polynomial smoothers would
>> be more concurrent.
> Yup.
>>> I have trouble finding decent preconditioning algorithms suitable  
>>> for
>>> the fine granularity of GPUs.  Matt thinks we can get rid of all the
>>> crappy sparse matrix kernels and precondition everything with FMM.
>> That is definitely my view, or at least my goal. And I would say  
>> this,
>> if we are just starting out on these things, I think it makes sense  
>> to
>> do the home runs first. If we just try and reproduce things, people
>> might say "That is nice, but I can already do that pretty well".
> Agreed, but it's also important to have something good to offer people
> who aren't ready to throw out everything they know and design a new
> algorithm based on a radically different approach that may or may  
> not be
> any good for their physics.
> Jed

More information about the petsc-dev mailing list