On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <span dir="ltr"><<a href="mailto:john.fettig@gmail.com" target="_blank">john.fettig@gmail.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div id=":510">What I see in your results is about 7x speedup by using 16 threads.  I<br>

think you should get better results by running 8 threads with 2<br>

processes because the memory can be allocated on separate memory<br>

controllers, and the memory will be physically closer to the cores.<br>

I'm surprised that you get worse results.<br></div></blockquote><div><br></div><div>Our intent is for the threads to use an explicit first-touch policy so that they get local memory even when you have threads across multiple NUMA zones.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":510">

It doesn't surprise me that an explicit code gets much better speedup.<br></div></blockquote><div><br></div><div>The explicit code is much less dependent on memory bandwidth relative to floating point.</div><div> </div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":510">

<div class="im"><br>

> I also get about the same performance results on the ex2 problem when running it with just<br>

> mpi alone i.e. with 16 mpi processes.<br>

><br>

> So from my perspective, the new pthreads/openmp support is looking pretty good assuming<br>

> the issue with the MKL/external packages interaction can be fixed.<br>

><br>

> I was just using jacobi preconditioning for ex2.  I'm wondering if there are any other preconditioners<br>

> that might be multi-threaded.  Or maybe a polynomial preconditioner could work well for the<br>

> new pthreads/openmp support.<br>

<br>

</div>GAMG with SOR smoothing seems like a prime candidate for threading.  I<br>

wonder if anybody has worked on this yet?<br></div></blockquote></div><br><div>SOR is not great because it's sequential. A block Jacobi/SOR parallelizes fine, but does not guarantee stability without additional (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well with threads (but not all the kernels are ready).</div>

<div><br></div><div>Coarsening and the Galerkin triple product is more difficult to thread.</div>