<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, Mar 14, 2017 at 10:52 PM, Jed Brown <span dir="ltr"><<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Jeff Hammond <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>> writes:<br>

<br>

> On Mon, Mar 13, 2017 at 8:08 PM, Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br>

>><br>

>> Jeff Hammond <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>> writes:<br>

>><br>

>> > OpenMP did not prevent OpenCL,<br>

>><br>

>> This programming model isn't really intended for architectures with<br>

>> persistent caches.<br>

>><br>

><br>

> It's not clear to me how much this should matter in a good implementation.<br>

> The lack of implementation effort for OpenCL on cache-coherent CPU<br>

> architectures appears to be a more significant issue.<br>

<br>

</span>How do you keep data resident in cache between kernel launches?<br>

<span class=""><br>

>> > C11, C++11<br>

>><br>

>> These are basically pthreads, which predates OpenMP.<br>

>><br>

><br>

> I'm not sure why it matters which one came first.  POSIX standardized<br>

> threads in 1995, while OpenMP was first standardized in 1997.  However, the<br>

> first serious Pthreads implementation in Linux was in 2003.<br>

<br>

</span>And the first serious OpenMP on OS X was when?<br>

<span class=""><br>

> OpenMP standardized the best practices identified in Kuck, SGI and<br>

> Cray directives, just like POSIX presumably standardized best<br>

> practices in OS threads from various Unix implementations.<br>

><br>

> C++11 and beyond have concurrency features beyond just threads.  You<br>

> probably hate all of them because they are C++, and in any case I won't<br>

> argue, because I don't see anything that's implemented better<br>

><br>

>><br>

>> > or Fortran 2008<br>

>><br>

>> A different language and doesn't play well with others.<br>

>><br>

><br>

> Sure, but you could use Fortran 2003 features to interoperate between C and<br>

> Fortran if you wanted to leverage Fortran 2008 concurrency features in an<br>

> ISO-compliant way.  I'm not suggesting you want to do this, but I dispute<br>

> the suggestion that Fortran does not play nice with C.<br>

<br>

</span>I think the above qualifies as not playing nicely in this context.<br>

<span class=""><br>

> Fortran coarrays images are OS processes in every implementation I know,<br>

> although the standard does not explicitly require this implementation.  The<br>

> situation is identical to that of MPI, although there are actually MPI<br>

> implementations based upon OS threads rather than OS processes (and they<br>

> require compiler or OS magic to deal with non-heap data).<br>

><br>

> Both of the widely available Fortran coarray implementations use MPI-3 RMA<br>

> under the hood and all of the ones I know about define an image to be an OS<br>

> process.<br>

<br>

</span>Are you trying to sell PETSc on MPI?<br>

<span class=""><br>

>> > from introducing parallelism. Not sure if your comment was meant to be<br>

>> > serious,<br>

>><br>

>> Partially.  It was just enough to give the appearance of a solution<br>

>> while not really being a solution.<br>

>><br>

><br>

> It still isn't clear what you actually want.  You appear to reject every<br>

> standard API for enabling explicit vectorization for CPU execution<br>

> (Fortran, OpenMP, OpenCL), which suggests that (1) you do not believe in<br>

> vectorization, (2) you think that autovectorizing compilers are sufficient,<br>

> (3) you think vector code is necessarily a non-portable software construct,<br>

> or (4) you do not think vectorization is relevant to PETSc.<br>

<br>

</span>OpenMP is strictly about vectorization with nothing to do with threads<br>

and MPI is sufficient?  I don't have a problem with that, but will<br>

probably stick to attributes and intrinsics instead of omp simd, at<br></blockquote><div><br></div><div>1) and 4) are clearly wrong. After decades of work, 2) seems practically wrong.</div><div><br></div><div>This looks like a 3) answer to me, and its hard to argue. OpenCL could have been a nice</div><div>way to get arch-independent vectorization, but the implementations suck badly.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

least until it matures and demonstrates feature parity.<br>

<br>

Have you tried writing a BLIS microkernel using omp simd?  Is it any<br>

good?<br>

</blockquote></div><br>Yep.</div><div class="gmail_extra"><br></div><div class="gmail_extra">   Matt<br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div>