Dual core performance estimate

Sun Nov 18 20:03:27 CST 2007

On Sun, 18 Nov 2007, Gideon Simpson wrote:

> I asked the original question, and I have a follow up.  Like it or not,
> multi-core CPUs have been thrust upon us by the manufacturers and many of us
> are more likely to have access to a shared memory, multi core/multi processor
> machine, than a properly built cluster with MPI in mind.

Sure they are here to stay.

> 1.  How feasible would it be to implement OpenMP in PETSc so that
> multi core CPUs could be properly used?

> 2.  Even if we are building a cluster, it looks like AMD/Intel are thrusting
> multi core up on is.  To that end, what is the feasibility of merging MPI and
> OpenMP so that between nodes, we use MPI, but within each node, OpenMP is used
> to take advantage of the multiple cores.

You are missing the point on previous e-mails on this topic. The point
was: when undersanding performance one gets on single/vs dual core -
one should investigate memory bandwidth behavior.

With sparse matrix operations, memory bandwidth is the primary
determining factor. So if you split-up 'the same amount of
memory-bandwidth between 2 processors, you split up performance
between them as well.

Memory bandwidth affects both OpenMP & MPI. Its not like - memory
bandwidth is MPI-only issue [and OpenMP somehow avoids this problem].

So the inference: "MPI is not suitable for multi-core, but OpenMP is
suitable" is incorrect. [if performance is limited by
memory-bandwidth].

So our sugestion is: be aware of this issue - when analysing the
performance you get. One way to look at it this is: performance per
dollar. Since the second core is practically free - even 5%
improvement [in 1 vs 2 node run] is a good investment. [There could be
other parts of the application that are not-memory bandwidth limited -
that benifit from the extra core]

Note-1: when folks compare MPI performance vs OpenMPI, or when
refering to mixed OpenMP/MPI code, they are sometimes mixing 2 things.

- implementation difference [OpenMP communication could be implemented
better than MPI communication on some machines]

- algorithmic difference [for eg: if you have a 4 way SMP.  if MPI
impl was using bjacobi with num_blocks=4, vs OpenMP - which just
unrolled a DirectSolver fortran subroutine]

We feel that the first one is an implementation issue, and MPI should
do the right thing. Wrt the second one, OpenMP/MPI mixed mode is more
of an algorithmic issue [generally 2 level algrorithm]. Same
2-level-algorithm implmeneted with MPI/MPI should have similar
behavior.

PETSc currently has some support for this with "-pc_type openmp"

Note-2: So multi-core hardware is the future, how does one fully
utilze them?

I guess one has to look at alternative algorithms that are not memory
bandwidth limited, perhas that can somehow reduce memory bandwith
requirement by just doing extra computation.  [perhaps new
researchwork? sorry I don't know more on this topic..]

Satish