since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

Sat Dec 5 15:50:10 CST 2009

On Fri, 4 Dec 2009 22:42:35 -0600, Barry Smith <bsmith at mcs.anl.gov> wrote:
> generally we would have one python process per compute node and local
> parallelism would be done via the low-level kernels to the cores
> and/or GPUs.

I think one MPI process per node is fine for MPI performance on good
hardware because the HCA reads and writes from registered memory without
involving the CPU, but I'm not sure it's actually a better model.

How do you envision implementing MatSetValues()?  If there is only one
MPI process per node, would there be another full level of domain
decomposition based on threads?  Otherwise you need a concurrent
MatSetValues which would make proper preallocation essential and make
cache coherence a very sensitive matter.

And the huge algorithmic issue: triangular kernels are the backbone of
almost every preconditioner and are inherently sequential.  If only one
process per node does MPI, then all these algorithms would need
three-level implementations (decompose the per-node subdomains into
per-core subdomains and use a different concurrency scheme at this
smaller granularity).  The use of threads on the many cores per node
potentially offers more performance through the use of lock-free shared
data structures with NUMA-aware work distribution.  But separate memory
space is much more deterministic, thus easier to work with.

I have trouble finding decent preconditioning algorithms suitable for
the fine granularity of GPUs.  Matt thinks we can get rid of all the
crappy sparse matrix kernels and precondition everything with FMM.

Noteh that all Python implementations have a global interpreter lock
which could also make a single Python process the bottleneck.

Jed