On Thu, Nov 24, 2011 at 6:37 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="im"><br>

On Nov 24, 2011, at 4:41 PM, Matthew Knepley wrote:<br>

<br>

> On Thu, Nov 24, 2011 at 4:09 PM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>

><br>

>   Jed,<br>

><br>

>   Let's stop arguing about whether MPI is or is not a good base for the next generation of HPC software but instead start a new conversation on what API (implemented on top of or not on top of MPI/pthreads etc etc) we want to build PETSc on to scale PETSc up to millions of cores with large NUMA nodes and GPU like accelerators.<br>


><br>

>    What do you want in the API?<br>

><br>

> Let's start with the "lowest" level, or at least the smallest. I think the only sane way to program for portable performance here<br>

> is using CUDA-type vectorization. This SIMT style is explained well here <a href="http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html" target="_blank">http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html</a><br>


> I think this is much easier and more portable than the intrinsics for Intel, and more performant and less error prone than threads.<br>

> I think you can show that it will accomplish anything we want to do. OpenCL seems to have capitulated on this point. Do we agree<br>

> here?<br>

<br>

</div>   What syntax do you suggest for writing the code that is "vectorized"?  What tools exist, could exist, for mapping from that syntax to what is needed by the various compilers/hardware?<br></blockquote><div>

<br></div><div>Some history. Brook is the basis for CUDA, but like any good foundation, almost everything the creator thought was important was thrown away</div><div>to make something usable. Brook is a streaming language, much like Thrust. When problems fit this paradigm, it is fantastic. However, CUDA is</div>

<div>not a streaming language. The programmer decides exactly what memory to put where, so what is there?</div><div><br></div><div>Both CUDA and OpenCL inherit the process grid. Of course, MPI already has this (rank), so what is different? All threads in a vector have access</div>

<div>to shared memory. I guess you could get the same thing in MPI if you had an idea of a "neighborhood" which had shared memory. This is exactly</div><div>how OpenCL handles it. You specify a "vector length" (our neighborhood size) and the compiler tries it ass off to vectorize the code. In fact, you</div>

<div>could probably manage everything I want with subcommunicators and a runtime code generator.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

   For daxpy() the syntax doesn't really matter, anything will do. For other kernels: maxpy, sparse matrix vector product, triangular solves, P*A*Pt, mesh operations, sorts, indirect access .... the choice of syntax likely matters a great deal. We should test the syntax out on a wide range of kernels. For example look at VecMAXPY_kernel vs VecMAXPY_Seq vs VecMAXPY_VecCUSPMAXPY4 and the three delegators VecMAXPY_SeqPThread, VecMAXPY_MPI, and VecMAXPY_SeqCUSP;<br>

</blockquote><div><br></div><div>For things as easy as BLAS, you can go all the way to the streaming-type kernel (Thrust, PyCUDA, etc). I think this is basically</div><div>solved. I am more interested in kernels like FD residual (ask Jed), FEM residual, and FEM Jacobian. I promise to finish this paper</div>

<div>by mid-December.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

   How does data layout relate to the vectorization you are going to do on that data and vis-versa?</blockquote><div><br></div><div>That is the crux. Vectorization is about execution layout (the CUDA thread grid). Somehow we must also layout memory and</div>

<div>match them up. This is all of programming (which is why I initially like the stuff from Victor).</div><div><br></div><div>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<font color="#888888"><br>

   Barry<br>

</font><div><div></div><div class="h5"><br>

><br>

>    Matt<br>

><br>

><br>

>   Barry<br>

><br>

><br>

><br>

><br>

> --<br>

> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

> -- Norbert Wiener<br>

<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener<br>