[petsc-dev] programming model for PETSc

Matthew Knepley knepley at gmail.com
Thu Nov 24 19:28:09 CST 2011


On Thu, Nov 24, 2011 at 6:37 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
> On Nov 24, 2011, at 4:41 PM, Matthew Knepley wrote:
>
> > On Thu, Nov 24, 2011 at 4:09 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >   Jed,
> >
> >   Let's stop arguing about whether MPI is or is not a good base for the
> next generation of HPC software but instead start a new conversation on
> what API (implemented on top of or not on top of MPI/pthreads etc etc) we
> want to build PETSc on to scale PETSc up to millions of cores with large
> NUMA nodes and GPU like accelerators.
> >
> >    What do you want in the API?
> >
> > Let's start with the "lowest" level, or at least the smallest. I think
> the only sane way to program for portable performance here
> > is using CUDA-type vectorization. This SIMT style is explained well here
> http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
> > I think this is much easier and more portable than the intrinsics for
> Intel, and more performant and less error prone than threads.
> > I think you can show that it will accomplish anything we want to do.
> OpenCL seems to have capitulated on this point. Do we agree
> > here?
>
>    What syntax do you suggest for writing the code that is "vectorized"?
>  What tools exist, could exist, for mapping from that syntax to what is
> needed by the various compilers/hardware?
>

Some history. Brook is the basis for CUDA, but like any good foundation,
almost everything the creator thought was important was thrown away
to make something usable. Brook is a streaming language, much like Thrust.
When problems fit this paradigm, it is fantastic. However, CUDA is
not a streaming language. The programmer decides exactly what memory to put
where, so what is there?

Both CUDA and OpenCL inherit the process grid. Of course, MPI already has
this (rank), so what is different? All threads in a vector have access
to shared memory. I guess you could get the same thing in MPI if you had an
idea of a "neighborhood" which had shared memory. This is exactly
how OpenCL handles it. You specify a "vector length" (our neighborhood
size) and the compiler tries it ass off to vectorize the code. In fact, you
could probably manage everything I want with subcommunicators and a runtime
code generator.


>   For daxpy() the syntax doesn't really matter, anything will do. For
> other kernels: maxpy, sparse matrix vector product, triangular solves,
> P*A*Pt, mesh operations, sorts, indirect access .... the choice of syntax
> likely matters a great deal. We should test the syntax out on a wide range
> of kernels. For example look at VecMAXPY_kernel vs VecMAXPY_Seq vs
> VecMAXPY_VecCUSPMAXPY4 and the three delegators VecMAXPY_SeqPThread,
> VecMAXPY_MPI, and VecMAXPY_SeqCUSP;
>

For things as easy as BLAS, you can go all the way to the streaming-type
kernel (Thrust, PyCUDA, etc). I think this is basically
solved. I am more interested in kernels like FD residual (ask Jed), FEM
residual, and FEM Jacobian. I promise to finish this paper
by mid-December.


>   How does data layout relate to the vectorization you are going to do on
> that data and vis-versa?


That is the crux. Vectorization is about execution layout (the CUDA thread
grid). Somehow we must also layout memory and
match them up. This is all of programming (which is why I initially like
the stuff from Victor).

   Matt


>
>   Barry
>
> >
> >    Matt
> >
> >
> >   Barry
> >
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> > -- Norbert Wiener
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20111124/3501a57b/attachment.html>


More information about the petsc-dev mailing list