<div class="gmail_quote">On Sun, Jun 19, 2011 at 21:39, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div id=":15n">Huhh? VecDot() {if n is >> big use 2 threads else use 1} I don't see why that is hard?<br></div></blockquote><div><br></div><div>VecMAXPY() when some vectors were faulted with different affinity. Most any use of VecPlaceArray(). Any bubbling of threads to a higher level (e.g. if all thread dispatch is not strictly done at the finest level of granularity). Client code that uses a different affinity during residual evaluation. Matrix preallocation with variation in row length. Index sets have different sizes than vectors.</div>
<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div id=":15n"><div class="im">
> A related matter that I keep harping on is that the memory hierarchy is very non-uniform. In the old days, it was reasonably uniform within a socket, but some of the latest hardware has multiple dies within a socket, each with more-or-less independent memory buses.<br>
<br>
</div> So what is the numa.h you've been using. If we allocate vector arrays and matrix arrays then does that give you the locality?<br></div></blockquote><div><br></div><div>That lets you specify explicitly at allocation time how you want the memory mapped. This can be achieved, more-or-less, by spawning a suitable number of OpenMP (or other paradigm) threads, making sure the OS/environment was configured so that they will have the affinity you desire, partitioning their work load as you want, and faulting the memory.</div>
<div><br></div><div>But numa.h also has primitives to move the physical pages associated with memory that you have allocated, e.g. numa_move_pages(), as well as query the mapping of other memory. If every platform supported libnuma (it's Linux-only), I think we would be a lot better off. We could build a slightly higher level abstraction on libnuma and have predictable, debuggable mapping of memory.</div>
<div><br></div><div>One option is to experiment and build this higher level abstraction using libnuma with a default implementation that does something less reliable on platforms without libnuma (non-Linux). Some primitives like numa_move_pages() are not at all available, so they would have to just do nothing and suffer the performance consequences.</div>
<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div id=":15n">
BTW: If it doesn't do it yet, ./configure needs to check for numa.h and do PETSC_HAVE_NUMA_H</div></blockquote></div><br><div>It doesn't, but I agree.</div>