Dual core performance estimate
balay at mcs.anl.gov
Sun Nov 18 20:03:27 CST 2007
On Sun, 18 Nov 2007, Gideon Simpson wrote:
> I asked the original question, and I have a follow up. Like it or not,
> multi-core CPUs have been thrust upon us by the manufacturers and many of us
> are more likely to have access to a shared memory, multi core/multi processor
> machine, than a properly built cluster with MPI in mind.
Sure they are here to stay.
> 1. How feasible would it be to implement OpenMP in PETSc so that
> multi core CPUs could be properly used?
> 2. Even if we are building a cluster, it looks like AMD/Intel are thrusting
> multi core up on is. To that end, what is the feasibility of merging MPI and
> OpenMP so that between nodes, we use MPI, but within each node, OpenMP is used
> to take advantage of the multiple cores.
You are missing the point on previous e-mails on this topic. The point
was: when undersanding performance one gets on single/vs dual core -
one should investigate memory bandwidth behavior.
With sparse matrix operations, memory bandwidth is the primary
determining factor. So if you split-up 'the same amount of
memory-bandwidth between 2 processors, you split up performance
between them as well.
Memory bandwidth affects both OpenMP & MPI. Its not like - memory
bandwidth is MPI-only issue [and OpenMP somehow avoids this problem].
So the inference: "MPI is not suitable for multi-core, but OpenMP is
suitable" is incorrect. [if performance is limited by
So our sugestion is: be aware of this issue - when analysing the
performance you get. One way to look at it this is: performance per
dollar. Since the second core is practically free - even 5%
improvement [in 1 vs 2 node run] is a good investment. [There could be
other parts of the application that are not-memory bandwidth limited -
that benifit from the extra core]
Note-1: when folks compare MPI performance vs OpenMPI, or when
refering to mixed OpenMP/MPI code, they are sometimes mixing 2 things.
- implementation difference [OpenMP communication could be implemented
better than MPI communication on some machines]
- algorithmic difference [for eg: if you have a 4 way SMP. if MPI
impl was using bjacobi with num_blocks=4, vs OpenMP - which just
unrolled a DirectSolver fortran subroutine]
We feel that the first one is an implementation issue, and MPI should
do the right thing. Wrt the second one, OpenMP/MPI mixed mode is more
of an algorithmic issue [generally 2 level algrorithm]. Same
2-level-algorithm implmeneted with MPI/MPI should have similar
PETSc currently has some support for this with "-pc_type openmp"
Note-2: So multi-core hardware is the future, how does one fully
I guess one has to look at alternative algorithms that are not memory
bandwidth limited, perhas that can somehow reduce memory bandwith
requirement by just doing extra computation. [perhaps new
researchwork? sorry I don't know more on this topic..]
More information about the petsc-users