<div class="gmail_quote">On Sun, Jun 19, 2011 at 03:33, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
> No, Cray does not provide any threaded BLAS 1. Generally speaking it is not worth threading a single nested loop unless the trip count is very high and generally that does not happen often enough to warrant the special BLAS. In fact, I am not even sure we omp BLAS 2, I don't think so.</blockquote>
</div><br><div>This shouldn't be surprising. It's worth noting that inside of multigrid (or a surface-volume split problem) the same vector operation will be called with quite different sizes. Managing the granularity so that threads are used when they will actually be faster, but not when the size is too small to offset the startup cost is not trivial.</div>
<div><br></div><div>A related matter that I keep harping on is that the memory hierarchy is very non-uniform. In the old days, it was reasonably uniform within a socket, but some of the latest hardware has multiple dies within a socket, each with more-or-less independent memory buses.</div>
<div><br></div><div>Of course you can always move MPI down to finer granularity (e.g. one MPI process per die instead of one per socket). I think this is a good solution for many applications, and may perform better than threads for reasonably large subdomains (mostly because memory affinity can be reliably managed), but it is less appealing in the strong scaling limit.</div>
<div><br></div><div>I have yet to see a threading library/language that offers a good platform for bandwidth-constrained strong scaling. The existing solutions tend to base everything on the absurd assumption that parallel computation is about parallelizing the computation. In reality, certainly for the sorts of problems that most PETSc users care about, everything hard about parallelism is how to communicate that which needs to be communicated while retaining good data locality on that which doesn't need to be communicated.</div>
<div><br></div><div>It's an easy local optimization to get high performance out of a local kernel. In contrast, optimizing for data locality often involves major data structure and algorithm changes.</div><div><br></div>
<div>The current systems all seem to be really bad at this, with data locality being something that is sometimes provided implicitly, but is ultimately very fragile. The number of recent threading papers that report less than 30% of hardware bandwidth peak for STREAM is not inspiring. (Rather few authors actually state the hardware peak, but if you look up the specs, it's really rare to see something respectable (e.g. 80%).)</div>