[petsc-dev] programming model for PETSc

Fri Nov 25 09:02:51 CST 2011

   Jed,

   This is a continuation of the thread started by Matt stating he wanted to use "vectorization" as the low level common kernels and I asked him for syntax (for example CUDA, OpenCL, ....) You are talking about something completely different and not relevant for that thread.

    Barry
]
On Nov 25, 2011, at 1:43 AM, Jed Brown wrote:

> On Thu, Nov 24, 2011 at 21:39, Barry Smith <bsmith at mcs.anl.gov> wrote:
> I have no idea what you are talking about. What is "local to global" and what are "these primitives"?
> 
> I wrote an earlier email where I outlined some primitives, namely a "pointwise" broadcast, reduce, fetch-and-add, gather, and scatter. Only the first three are truly "primitives", but all five are directly useful to applications. For these, we identify a "local" and "global" space. The ownership of the send and receive buffers need not be local and global, but I'm using these names because they are most familiar to us. The specification of the communication graph _always_ resides in the local space, and the local space has the restriction that each point maps to at most one global point. Is usual, several local points may map to the same global point.
> 
> All communication is initiated by the local side, although the global side may be involved to prepare buffers so that we expose useful collective semantics.
>  
> 
>      Example, how do I write a triangular solve using "local to global" and "these primatives?
> 
> You want parallel sparse triangular solve? Which algorithm do you want to use?
>  
> 
>     How do I write application of a stencil operator using "local to global" and "these primatives?
> 
>      How do I write sparse matrix vector product using "local to global" and "these primatives?
> 
> These two look the same, broadcast the values from global to local (natural implementation is that local processes call MPI_Get(), completed when all call MPI_Win_fence()).
> 
> For transpose multiply, we apply the matrix and finish with reduction on the destination global space (naturally implemented with MPI_Accumulate() called by senders, completed when all call MPI_Win_fence()). This is the same communication pattern as symmetric additive Schwarz and Neumann DD methods/storage formats.
> 
> These are the simplest building cases, I think redistributing a mesh after partitioning is a more interesting operation. I explained how that would be done earlier in this thread, but I understand that there is a difference between explaining in an email and showing working code. I'll get busy with the latter in a couple weeks.