[petsc-dev] programming model for PETSc

Fri Nov 25 16:48:59 CST 2011

On Thu, Nov 24, 2011 at 7:28 PM, Matthew Knepley <knepley at gmail.com> wrote:

> On Thu, Nov 24, 2011 at 6:37 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>>
>> On Nov 24, 2011, at 4:41 PM, Matthew Knepley wrote:
>>
>> > On Thu, Nov 24, 2011 at 4:09 PM, Barry Smith <bsmith at mcs.anl.gov>
>> wrote:
>> >
>> >   Jed,
>> >
>> >   Let's stop arguing about whether MPI is or is not a good base for the
>> next generation of HPC software but instead start a new conversation on
>> what API (implemented on top of or not on top of MPI/pthreads etc etc) we
>> want to build PETSc on to scale PETSc up to millions of cores with large
>> NUMA nodes and GPU like accelerators.
>> >
>> >    What do you want in the API?
>> >
>> > Let's start with the "lowest" level, or at least the smallest. I think
>> the only sane way to program for portable performance here
>> > is using CUDA-type vectorization. This SIMT style is explained well
>> here
>> http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
>> > I think this is much easier and more portable than the intrinsics for
>> Intel, and more performant and less error prone than threads.
>> > I think you can show that it will accomplish anything we want to do.
>> OpenCL seems to have capitulated on this point. Do we agree
>> > here?
>>
>>    What syntax do you suggest for writing the code that is "vectorized"?
>>  What tools exist, could exist, for mapping from that syntax to what is
>> needed by the various compilers/hardware?
>>
>
> Some history. Brook is the basis for CUDA, but like any good foundation,
> almost everything the creator thought was important was thrown away
> to make something usable. Brook is a streaming language, much like Thrust.
> When problems fit this paradigm, it is fantastic. However, CUDA is
> not a streaming language. The programmer decides exactly what memory to
> put where, so what is there?
>
> Both CUDA and OpenCL inherit the process grid. Of course, MPI already has
> this (rank), so what is different? All threads in a vector have access
> to shared memory. I guess you could get the same thing in MPI if you had
> an idea of a "neighborhood" which had shared memory. This is exactly
> how OpenCL handles it. You specify a "vector length" (our neighborhood
> size) and the compiler tries it ass off to vectorize the code. In fact, you
> could probably manage everything I want with subcommunicators and a
> runtime code generator.
>

Synopsis of what I said before to elicit comment:

1) I think the only thing we can learn from Brook, CUDA, OpenCL is that you
identify threads by a grid ID.

2) Things like BLAS are so easy that you can move up to the streaming
model, but this does not work for

  - FD and FEM residual evaluation (Jed has an FD example with Aron, SNES
ex52 is my FEM example)

  - FD and FEM Jacobian evaluation

3) If you look at ex52 I do a "thread transposition" meaning threads start
working on different areas of
    memory which looks like a transpose on a 2D grid. I can do this using
shared memory for the vector group.

The API is very simple. Give grid indices to the thread, and its done in
CUDA and OpenCL essentially the
same way.

   Matt

>   For daxpy() the syntax doesn't really matter, anything will do. For
>> other kernels: maxpy, sparse matrix vector product, triangular solves,
>> P*A*Pt, mesh operations, sorts, indirect access .... the choice of syntax
>> likely matters a great deal. We should test the syntax out on a wide range
>> of kernels. For example look at VecMAXPY_kernel vs VecMAXPY_Seq vs
>> VecMAXPY_VecCUSPMAXPY4 and the three delegators VecMAXPY_SeqPThread,
>> VecMAXPY_MPI, and VecMAXPY_SeqCUSP;
>>
>
> For things as easy as BLAS, you can go all the way to the streaming-type
> kernel (Thrust, PyCUDA, etc). I think this is basically
> solved. I am more interested in kernels like FD residual (ask Jed), FEM
> residual, and FEM Jacobian. I promise to finish this paper
> by mid-December.
>
>
>>   How does data layout relate to the vectorization you are going to do on
>> that data and vis-versa?
>
>
> That is the crux. Vectorization is about execution layout (the CUDA thread
> grid). Somehow we must also layout memory and
> match them up. This is all of programming (which is why I initially like
> the stuff from Victor).
>
>    Matt
>
>
>>
>>   Barry
>>
>> >
>> >    Matt
>> >
>> >
>> >   Barry
>> >
>> >
>> >
>> >
>> > --
>> > What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> > -- Norbert Wiener
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20111125/df95684e/attachment.html>