since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

Sat Dec 5 16:02:38 CST 2009

On Sat, Dec 5, 2009 at 3:50 PM, Jed Brown <jed at 59a2.org> wrote:

> On Fri, 4 Dec 2009 22:42:35 -0600, Barry Smith <bsmith at mcs.anl.gov> wrote:
> > generally we would have one python process per compute node and local
> > parallelism would be done via the low-level kernels to the cores
> > and/or GPUs.
>
> I think one MPI process per node is fine for MPI performance on good
> hardware because the HCA reads and writes from registered memory without
> involving the CPU, but I'm not sure it's actually a better model.
>
> How do you envision implementing MatSetValues()?  If there is only one
> MPI process per node, would there be another full level of domain
> decomposition based on threads?  Otherwise you need a concurrent
> MatSetValues which would make proper preallocation essential and make
> cache coherence a very sensitive matter.
>

I need to understand better. You are asking about the case where we have
many GPUs and one CPU? If its always one or two GPUs per CPU I do not
see the problem.

And the huge algorithmic issue: triangular kernels are the backbone of
> almost every preconditioner and are inherently sequential.  If only one
> process per node does MPI, then all these algorithms would need
> three-level implementations (decompose the per-node subdomains into
> per-core subdomains and use a different concurrency scheme at this
> smaller granularity).  The use of threads on the many cores per node
> potentially offers more performance through the use of lock-free shared
> data structures with NUMA-aware work distribution.  But separate memory
> space is much more deterministic, thus easier to work with.
>

Hmm, still not quite getting this problem. We need concurrency on the GPU,
but why would we need it on the CPU? On the GPU, triangular solve will be
just as crappy as it currently is, but will look even worse due to large
number
of cores. It just has very little concurrency. We need a better option. It
is not
the only smoother. For instance, polynomial smoothers would be more
concurrent.

> I have trouble finding decent preconditioning algorithms suitable for
> the fine granularity of GPUs.  Matt thinks we can get rid of all the
> crappy sparse matrix kernels and precondition everything with FMM.
>

That is definitely my view, or at least my goal. And I would say this, if we
are just
starting out on these things, I think it makes sense to do the home runs
first. If we
just try and reproduce things, people might say "That is nice, but I can
already do that
pretty well".

   Matt

> Noteh that all Python implementations have a global interpreter lock
> which could also make a single Python process the bottleneck.
>
> Jed
>
-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/71c4bab1/attachment.html>