Hello Rob,<br>     Thanks for the update, this might be very valuable for other developers out there!<br>I am still a little surprised by the performance of MKL, I use it for a variety of problems and haven&#39;t come across a situation where performance has been penalized! Maybe there are more experienced developers out here who have seen otherwise. Have you tried ATLAS by any chance?<br>

<br>When the MPI performance is concerned, I wouldn&#39;t be very surprised if intelMPI turns out to be faster than MPICH2. All this is of course fabric dependent but has been as I said in my experience for a variety of problems. Also, I think intelMPI is a customized version of MPICH2, so may not be a big surprise there<br>

<br> There are very experienced people in this forum who might be able to say otherwise and give more accurate answers<br><br>Cheers,<br><br>C.S.N<br><br><br><div class="gmail_quote">On Wed, Mar 16, 2011 at 4:22 PM, Robert Ellis <span dir="ltr">&lt;<a href="mailto:Robert.Ellis@geosoft.com">Robert.Ellis@geosoft.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<div link="blue" vlink="purple" lang="EN-CA">

<div>

<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);">Hi All,</span></p>

<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);"> </span></p>

<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);">For those still interested in this thread, timing tests with MKL indicate that sequential MKL performs approximately the same as parallel MKL with NUM_THREADS=1,

 which isn&#39;t too surprising. What is a bit surprising is that MKL always, at least for this application, gives significantly slower performance than direct compilation of the code from  --download-f-blas-lapack=1. My conclusion is that if your code is written

 with explicit parallelization, in this case using PETSc, and fully utilizes your hardware,  using sophisticated libraries may actually harm performance. Keep it simple!</span></p>

<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);"> </span></p>

<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);">Now a question: all my tests used MPICH2. Does anyone think using Intel MPI would significantly improve the performance of MKL with PETSc?</span></p>


<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);"> </span></p>

<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);">Cheers,</span></p>

<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);">Rob</span></p>

<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);"> </span></p>

<div>

<div style="border-style: solid none none; border-color: rgb(181, 196, 223) -moz-use-text-color -moz-use-text-color; border-width: 1pt medium medium; padding: 3pt 0in 0in;">

<p class="MsoNormal"><b><span style="font-size: 10pt;" lang="EN-US">From:</span></b><span style="font-size: 10pt;" lang="EN-US"> <a href="mailto:petsc-users-bounces@mcs.anl.gov" target="_blank">petsc-users-bounces@mcs.anl.gov</a> [mailto:<a href="mailto:petsc-users-bounces@mcs.anl.gov" target="_blank">petsc-users-bounces@mcs.anl.gov</a>]

<b>On Behalf Of </b>Rob Ellis<br>

<b>Sent:</b> Tuesday, March 15, 2011 3:33 PM<br>

<b>To:</b> &#39;PETSc users list&#39;<div class="im"><br>

<b>Subject:</b> Re: [petsc-users] Building with MKL 10.3</div></span></p>

</div>

</div>

<p class="MsoNormal"> </p>

<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);">Yes, MKL_DYNAMIC was set to true. No, I haven&#39;t tested on Nehalem. I&#39;m currently comparing sequential MKL with --download-f-blas-lapack=1.</span></p>


<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);">Rob</span></p>

<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);"> </span></p>

<div style="border-style: solid none none; border-color: rgb(181, 196, 223) -moz-use-text-color -moz-use-text-color; border-width: 1pt medium medium; padding: 3pt 0in 0in;">

<p class="MsoNormal"><b><span style="font-size: 10pt;" lang="EN-US">From:</span></b><span style="font-size: 10pt;" lang="EN-US"> <a href="mailto:petsc-users-bounces@mcs.anl.gov" target="_blank">petsc-users-bounces@mcs.anl.gov</a> [mailto:<a href="mailto:petsc-users-bounces@mcs.anl.gov" target="_blank">petsc-users-bounces@mcs.anl.gov</a>]

<b>On Behalf Of </b>Natarajan CS<div class="im"><br>

<b>Sent:</b> Tuesday, March 15, 2011 3:20 PM<br>

<b>To:</b> PETSc users list<br>

<b>Cc:</b> Robert Ellis<br>

</div><div class="im"><b>Subject:</b> Re: [petsc-users] Building with MKL 10.3</div></span></p>

</div>

<p class="MsoNormal"> </p>

<p class="MsoNormal">Thanks Eric and Rob.</p><div><div></div><div class="h5"><br>

<br>

Indeed! Was MKL_DYNAMIC set to default (true)? It looks like using 1 thread per core (sequential MKL) is the right thing to do as baseline.<br>

 I would think that the performance of #cores =  num_mpi_processes * num_mkl_threads might be &lt;= #cores = num_mpi_processes case (# cores const) unless some cache effects come into play (Not sure what, I would think the mkl installation should weed these issues

 out).</div></div><div><div></div><div class="h5">

<div>

<p class="MsoNormal"><br>

<br>

P.S :<br>

Out of curiosity have you also tested your app on Nehalem? Any difference between Nehalem vs Westmere for similar bandwidth?<br>

<br>

On Tue, Mar 15, 2011 at 4:35 PM, Jed Brown &lt;<a href="mailto:jed@59a2.org" target="_blank">jed@59a2.org</a>&gt; wrote:</p>

<div>

<div>

<p class="MsoNormal">On Tue, Mar 15, 2011 at 22:30, Robert Ellis &lt;<a href="mailto:Robert.Ellis@geosoft.com" target="_blank">Robert.Ellis@geosoft.com</a>&gt; wrote:</p>

<p class="MsoNormal">Regardless of setting the number of threads for MKL or OMP, the MKL performance was worse than simply using --download-f-blas-lapack=1.</p>

</div>

<p class="MsoNormal"> </p>

</div>

<div>

<p class="MsoNormal">Interesting. Does this statement include using just one thread, perhaps with a non-threaded MKL? Also, when you used threading, were you putting an MPI process on every core or were you making sure that you had enough cores for num_mpi_processes

 * num_mkl_threads?</p>

</div>

</div>

<p class="MsoNormal"> </p>

</div></div></div>

</div>


</blockquote></div><br>