On Thu, Nov 24, 2011 at 4:09 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

   Jed,<br>

<br>

   Let's stop arguing about whether MPI is or is not a good base for the next generation of HPC software but instead start a new conversation on what API (implemented on top of or not on top of MPI/pthreads etc etc) we want to build PETSc on to scale PETSc up to millions of cores with large NUMA nodes and GPU like accelerators.<br>


<br>

    What do you want in the API?</blockquote><div><br></div><div>Let's start with the "lowest" level, or at least the smallest. I think the only sane way to program for portable performance here</div><div>is using CUDA-type vectorization. This SIMT style is explained well here <a href="http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html">http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html</a></div>

<div>I think this is much easier and more portable than the intrinsics for Intel, and more performant and less error prone than threads.</div><div>I think you can show that it will accomplish anything we want to do. OpenCL seems to have capitulated on this point. Do we agree</div>

<div>here?</div><div><br></div><div>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><font color="#888888"><br>

   Barry<br>

<br>

</font></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener<br>