[petsc-users] Building with MKL 10.3

Wed Mar 16 16:22:11 CDT 2011

Hi All,

For those still interested in this thread, timing tests with MKL indicate that sequential MKL performs approximately the same as parallel MKL with NUM_THREADS=1, which isn't too surprising. What is a bit surprising is that MKL always, at least for this application, gives significantly slower performance than direct compilation of the code from  --download-f-blas-lapack=1. My conclusion is that if your code is written with explicit parallelization, in this case using PETSc, and fully utilizes your hardware,  using sophisticated libraries may actually harm performance. Keep it simple!

Now a question: all my tests used MPICH2. Does anyone think using Intel MPI would significantly improve the performance of MKL with PETSc?

Cheers,
Rob

From: petsc-users-bounces at mcs.anl.gov [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Rob Ellis
Sent: Tuesday, March 15, 2011 3:33 PM
To: 'PETSc users list'
Subject: Re: [petsc-users] Building with MKL 10.3

Yes, MKL_DYNAMIC was set to true. No, I haven't tested on Nehalem. I'm currently comparing sequential MKL with --download-f-blas-lapack=1.
Rob

From: petsc-users-bounces at mcs.anl.gov [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Natarajan CS
Sent: Tuesday, March 15, 2011 3:20 PM
To: PETSc users list
Cc: Robert Ellis
Subject: Re: [petsc-users] Building with MKL 10.3

Thanks Eric and Rob.

Indeed! Was MKL_DYNAMIC set to default (true)? It looks like using 1 thread per core (sequential MKL) is the right thing to do as baseline.
 I would think that the performance of #cores =  num_mpi_processes * num_mkl_threads might be <= #cores = num_mpi_processes case (# cores const) unless some cache effects come into play (Not sure what, I would think the mkl installation should weed these issues out).

P.S :
Out of curiosity have you also tested your app on Nehalem? Any difference between Nehalem vs Westmere for similar bandwidth?

On Tue, Mar 15, 2011 at 4:35 PM, Jed Brown <jed at 59a2.org<mailto:jed at 59a2.org>> wrote:
On Tue, Mar 15, 2011 at 22:30, Robert Ellis <Robert.Ellis at geosoft.com<mailto:Robert.Ellis at geosoft.com>> wrote:
Regardless of setting the number of threads for MKL or OMP, the MKL performance was worse than simply using --download-f-blas-lapack=1.

Interesting. Does this statement include using just one thread, perhaps with a non-threaded MKL? Also, when you used threading, were you putting an MPI process on every core or were you making sure that you had enough cores for num_mpi_processes * num_mkl_threads?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110316/671ddb29/attachment.htm>