[mpich-discuss] Why is my quad core slower than cluster

Mon Jul 14 22:42:30 CDT 2008

Hi everyone,

  I'll echo the sentiments expressed by Tan and a few others that the
culprit here, at least for Gaetano's code, is probably the memory
bandwidth.  The FDTD applications I've seen tend to be bandwidth-hungry, and
the current Intel quad cores are not very good at this, especially those
with the 1333 Mhz FSB such as the E5345.  Some of the newer models support
the 1600 Mhz FSB speeds and tend to deliver better results.... for example,
using the SPEC fp_rate benchmarks (and specifically that of GemsFDTD) as a
really rough approximation to parallel performance, an E5450 at 3.0 Ghz
scores 29.0 vs. 36.7 for the E5472 processor.  Same chip, but faster memory,
resulting in 26.5% faster performance overall.  [Note, I used the Supermicro
results for this, but you can find similar results for any of them, I
imagine.]

  This isn't limited to FDTD codes, either... CFD-like applications are
notoriously bandwidth hungry, and using SPEC scores again, WRF gives us 51.9
vs. 70.2 for the same configurations above.  Again, these are the same
processors, same compilers, and same tests done by guys who definitely like
to eek the utmost performance from their rig, and this shows that simply
adding faster memory improves things measurably... 35% in this case.  Since
fp_rate isn't really the same as parallel performance, though, let's switch
to some first-hand measurements - I was recently running some WRF models on
a system here using the E5440 (2.83 Ghz) processors, and here are the
results:

  Running on 128 cores as 16 nodes x 8 cores per node:   22.8 seconds / step
                    128 cores as 32 nodes x 4 cores per node:  11.9 seconds
/ step
                    128 cores as 64 nodes x 2 cores per node:  10.4 seconds
/ step

  ... As you can see, the 'best' performance comes from using only 2 cores
per node!  In fact, I was able to run on 16 nodes and 4 cores per node, for
a total of 64 cores, and it was only 2% slower than running on 128 cores (as
16 x 8).   I didn't fiddle with task placement since MPI and the OS are *
generally* pretty smart about such things, and the data points towards
memory bandwidth being a key issue anyways.  Hopefully having some of these
'hard numbers' can ease your burden so you don't go crazy trying to find a
reason in MPICH, the OS, etc.  ;)

  Put another way, if you're concerned that your quad-core box isn't working
properly via MPI, you can write up a code (or download a code) that does
something that is not limited by bandwidth - generating random numbers, for
example - and run it, and you *should* see a fully linear speedup (within
the constraints of Amdahl's Law).  So the only recommendations I'd make are:

  1) Use an up-to-date MPI implementation (such as MPICH2 1.0.7) and OS
since they'll probably be 'smarter' about task placement than older versions
  2) Try using the Intel compilers if you haven't done so already since they
tend to be superior to gfortran (and many times gcc as well)
  3) If you're buying hardware soon, look at the (more expensive) 1600 Mhz
FSB boards / chips from Intel.

  Hope that's useful!

  Cheers,
  - Brian

Brian Dobbins
Yale Engineering HPC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080714/661ed58a/attachment.htm>