<div dir="ltr">On Mon, Mar 13, 2017 at 8:08 PM, Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br>><br>> Jeff Hammond <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>> writes:<br>><br>> > OpenMP did not prevent OpenCL,<br>><br>> This programming model isn't really intended for architectures with<br>> persistent caches.<br>><br><br>It's not clear to me how much this should matter in a good implementation.  The lack of implementation effort for OpenCL on cache-coherent CPU architectures appears to be a more significant issue.<br> <br>><br>> > C11, C++11<br>><br>> These are basically pthreads, which predates OpenMP.<br>><br><br>I'm not sure why it matters which one came first.  POSIX standardized threads in 1995, while OpenMP was first standardized in 1997.  However, the first serious Pthreads implementation in Linux was in 2003.  OpenMP standardized the best practices identified in Kuck, SGI and Cray directives, just like POSIX presumably standardized best practices in OS threads from various Unix implementations.<br><br>C++11 and beyond have concurrency features beyond just threads.  You probably hate all of them because they are C++, and in any case I won't argue, because I don't see anything that's implemented better<br> <br>><br>> > or Fortran 2008<br>><br>> A different language and doesn't play well with others.<br>><br><br>Sure, but you could use Fortran 2003 features to interoperate between C and Fortran if you wanted to leverage Fortran 2008 concurrency features in an ISO-compliant way.  I'm not suggesting you want to do this, but I dispute the suggestion that Fortran does not play nice with C.<br><br>Fortran coarrays images are OS processes in every implementation I know, although the standard does not explicitly require this implementation.  The situation is identical to that of MPI, although there are actually MPI implementations based upon OS threads rather than OS processes (and they require compiler or OS magic to deal with non-heap data).<br><br>Both of the widely available Fortran coarray implementations use MPI-3 RMA under the hood and all of the ones I know about define an image to be an OS process.<br> <br>> > from introducing parallelism. Not sure if your comment was meant to be<br>> > serious,<br>><br>> Partially.  It was just enough to give the appearance of a solution<br>> while not really being a solution.<br>><br><br>It still isn't clear what you actually want.  You appear to reject every standard API for enabling explicit vectorization for CPU execution (Fortran, OpenMP, OpenCL), which suggests that (1) you do not believe in vectorization, (2) you think that autovectorizing compilers are sufficient, (3) you think vector code is necessarily a non-portable software construct, or (4) you do not think vectorization is relevant to PETSc.<div><br>Jeff<br> <br>><br>> > but it appears unfounded nonetheless.<br>> ><br>> > Jeff<br>> ><br>> > On Sun, Mar 12, 2017 at 11:16 AM Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br>> ><br>> >> Implementation-defined, but it's exactly the same as malloc, which also<br>> >> doesn't promise unfaulted pages. This is one reason some of us keep saying<br>> >> that OpenMP sucks. It's a shitty standard that obstructs better standards<br>> >> from being created.<br>> >><br>> >><br>> >> On March 12, 2017 11:19:49 AM MDT, Jeff Hammond <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>><br>> >> wrote:<br>> >><br>> >><br>> >> On Sat, Mar 11, 2017 at 9:00 AM Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br>> >><br>> >> Jeff Hammond <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>> writes:<br>> >> > I agree 100% that multithreaded codes that fault pages from the main<br>> >> thread in a NUMA environment are doing something wrong ;-)<br>> >> ><br>> >> > Does calloc *guarantee* pages are not mapped? If I calloc(8), do I get<br>> >> the zero page or part of the arena that's already mapped that is zeroed by<br>> >> the heap manager?<br>> >><br>> >> Is your argument that calloc() should never be used in multi-threaded code?<br>> >><br>> >><br>> >> I never use it for code that I want to behave well in a NUMA environment.<br>> >><br>> >><br>> >> If the allocation is larger than MMAP_THRESHOLD (128 KiB by default for<br>> >> glibc) then it calls mmap.  This obviously leaves an intermediate size<br>> >> that could be poorly mapped (assuming 4 KiB pages), but it's also so<br>> >> small that it easily fits in cache.<br>> >><br>> >><br>> >> Is this behavior standardized or merely implementation-defined? I'm not<br>> >> interested in writing code that assumes Linux/glibc.<br>> >><br>> >> Jeff<br>> >><br>> >><br>> >> --<br>> >> Jeff Hammond<br>> >> <a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br>> >> <a href="http://jeffhammond.github.io/">http://jeffhammond.github.io/</a><br>> >><br>> >> --<br>> > Jeff Hammond<br>> > <a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br>> > <a href="http://jeffhammond.github.io/">http://jeffhammond.github.io/</a><br><br><br><br><br>--<br>Jeff Hammond<br><a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/">http://jeffhammond.github.io/</a></div></div>