Thanks Eric and Rob.<br><br>Indeed! Was MKL_DYNAMIC set to default (true)? It looks like using 1 thread per core (sequential MKL) is the right thing to do as baseline.<br> I would think that the performance of #cores =  num_mpi_processes * num_mkl_threads might be &lt;= #cores = num_mpi_processes case (# cores const) unless some cache effects come into play (Not sure what, I would think the mkl installation should weed these issues out).<br>

<div class="gmail_quote"><br><br>P.S :<br>Out of curiosity have you also tested your app on Nehalem? Any difference between Nehalem vs Westmere for similar bandwidth?<br><br>On Tue, Mar 15, 2011 at 4:35 PM, Jed Brown <span dir="ltr">&lt;<a href="mailto:jed@59a2.org">jed@59a2.org</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im"><div class="gmail_quote">On Tue, Mar 15, 2011 at 22:30, Robert Ellis <span dir="ltr">&lt;<a href="mailto:Robert.Ellis@geosoft.com" target="_blank">Robert.Ellis@geosoft.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Regardless of setting the number of threads for MKL or OMP, the MKL performance was worse than simply using --download-f-blas-lapack=1.</blockquote></div><br></div><div>Interesting. Does this statement include using just one thread, perhaps with a non-threaded MKL? Also, when you used threading, were you putting an MPI process on every core or were you making sure that you had enough cores for num_mpi_processes * num_mkl_threads?</div>


</blockquote></div><br>