[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Sun Oct 7 07:58:53 CDT 2012

>         Okay, the matrix will have to partition itself. What is the
>         advantage of
>         having a single CPU process addressing multiple GPUs? Why not use
>         different MPI processes? (We can have the MPI processes sharing
>         a node
>         create a subcomm so they can decide which process is driving
>         which device.)
>
>
>     Making MPI a prerequisite for multi-GPU usage would be a unnecessary
>     restriction, wouldn't it?
>
>
> Small point: I don't believe this, in fact the opposite. There are many
> equivalent ways of doing these
> things, and we should use the simplest and most structured that can
> accomplish our goal. We have
> already bought into MPI and should never fall into the trap of trying to
> support another paradigm at
> the expense of the simplicity.

 From the simplicity point of view, you're absolutely right. However, 
MPI is only an implementation of the model used for distributed 
computing/memory (including 'faking' distributed memory on a shared 
memory system). With all the complex memory hierarchies introduced in 
the last years, we may have to adapt our programming approach in order 
to get reasonable performance, even though MPI would be able to 
accomplish this (yet at higher costs - even on shared memory systems, 
MPI messaging is not a free lunch).

Best regards,
Karli