[petsc-users] Building with MKL 10.3

Thu Mar 17 03:24:49 CDT 2011

I am very suspicious of any results where the fortran blas is
out-performing the MKL.

A

On Thu, Mar 17, 2011 at 3:24 AM, Natarajan CS <csnataraj at gmail.com> wrote:
> Hello Rob,
>      Thanks for the update, this might be very valuable for other developers
> out there!
> I am still a little surprised by the performance of MKL, I use it for a
> variety of problems and haven't come across a situation where performance
> has been penalized! Maybe there are more experienced developers out here who
> have seen otherwise. Have you tried ATLAS by any chance?
>
> When the MPI performance is concerned, I wouldn't be very surprised if
> intelMPI turns out to be faster than MPICH2. All this is of course fabric
> dependent but has been as I said in my experience for a variety of problems.
> Also, I think intelMPI is a customized version of MPICH2, so may not be a
> big surprise there
>
>  There are very experienced people in this forum who might be able to say
> otherwise and give more accurate answers
>
> Cheers,
>
> C.S.N
>
>
> On Wed, Mar 16, 2011 at 4:22 PM, Robert Ellis <Robert.Ellis at geosoft.com>
> wrote:
>>
>> Hi All,
>>
>>
>>
>> For those still interested in this thread, timing tests with MKL indicate
>> that sequential MKL performs approximately the same as parallel MKL with
>> NUM_THREADS=1, which isn't too surprising. What is a bit surprising is that
>> MKL always, at least for this application, gives significantly slower
>> performance than direct compilation of the code from
>>  --download-f-blas-lapack=1. My conclusion is that if your code is written
>> with explicit parallelization, in this case using PETSc, and fully utilizes
>> your hardware,  using sophisticated libraries may actually harm performance.
>> Keep it simple!
>>
>>
>>
>> Now a question: all my tests used MPICH2. Does anyone think using Intel
>> MPI would significantly improve the performance of MKL with PETSc?
>>
>>
>>
>> Cheers,
>>
>> Rob
>>
>>
>>
>> From: petsc-users-bounces at mcs.anl.gov
>> [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Rob Ellis
>> Sent: Tuesday, March 15, 2011 3:33 PM
>> To: 'PETSc users list'
>>
>> Subject: Re: [petsc-users] Building with MKL 10.3
>>
>>
>>
>> Yes, MKL_DYNAMIC was set to true. No, I haven't tested on Nehalem. I'm
>> currently comparing sequential MKL with --download-f-blas-lapack=1.
>>
>> Rob
>>
>>
>>
>> From: petsc-users-bounces at mcs.anl.gov
>> [mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Natarajan CS
>>
>> Sent: Tuesday, March 15, 2011 3:20 PM
>> To: PETSc users list
>> Cc: Robert Ellis
>> Subject: Re: [petsc-users] Building with MKL 10.3
>>
>>
>>
>> Thanks Eric and Rob.
>>
>> Indeed! Was MKL_DYNAMIC set to default (true)? It looks like using 1
>> thread per core (sequential MKL) is the right thing to do as baseline.
>>  I would think that the performance of #cores =  num_mpi_processes *
>> num_mkl_threads might be <= #cores = num_mpi_processes case (# cores const)
>> unless some cache effects come into play (Not sure what, I would think the
>> mkl installation should weed these issues out).
>>
>> P.S :
>> Out of curiosity have you also tested your app on Nehalem? Any difference
>> between Nehalem vs Westmere for similar bandwidth?
>>
>> On Tue, Mar 15, 2011 at 4:35 PM, Jed Brown <jed at 59a2.org> wrote:
>>
>> On Tue, Mar 15, 2011 at 22:30, Robert Ellis <Robert.Ellis at geosoft.com>
>> wrote:
>>
>> Regardless of setting the number of threads for MKL or OMP, the MKL
>> performance was worse than simply using --download-f-blas-lapack=1.
>>
>>
>>
>> Interesting. Does this statement include using just one thread, perhaps
>> with a non-threaded MKL? Also, when you used threading, were you putting an
>> MPI process on every core or were you making sure that you had enough cores
>> for num_mpi_processes * num_mkl_threads?
>>
>>
>