[mpich-discuss] Scalability of Intel quad core (Harpertown) cluster

Fri Mar 28 07:55:59 CDT 2008

Hi,

  I don't use the Cactus code myself, but from what little I *do* know of
it, this might not be unexpected.  For starters, what do you mean by 'bad
scalability'?  I believe most (all?) benchmark cases demonstrate what is
called 'weak scaling' - that is, the problem size increases along with the
number of processors.  So, running on 1 processor gives you a wall-clock
time of *n* seconds, and running on 2 processors will probably give you a
problem size of *n*+<some small number>.  That small number is the
communication time of your code.  Thus, running on 80 cores *will* be slower
than running on 1, but it'll let you run a much larger system.

  (To clarify, unless you're specifically configuring a constant problem
size, you won't reduce your time to solution by increasing your processors.)

  The next thing to consider is which benchmark you're using, as some of
them are more scalable than others.  You're likely to get different results
when looking at the 'Whiskey Carpet' benchmark vs. the 'BSSN_PUGH' one.  You
might wish to take a look at the benchmark database at the Cactus website,
and there are some PDF files with more information, too, including a
master's thesis on benchmark performance.

  Finally, a few other slightly more technical things to consider are:

(1) What kind of Harpertowns are you using?  Looking at the 5450 vs. the
5472 (both 3.0 Ghz chips), the latter has more memory bandwidth, and may
scale better since the code does appear to make use of it.  Using the
CPU2006fp_rate CactusADM benchmarks as a first approximation to parallel
performance, the SPEC website shows that a 5450 gets a score of 101
and 73.1when going from 1 -> 8 cores (and larger is better here - this
is
throughput, not wall-clock time), and the 5472 goes from 112 -> 84.8.  Why
does this matter?  Well, you'll probably get different results running an
8-core job when running that as 8x1, 4x2, or 1x8 (cores x nodes).   This
will impact your benchmark results somewhat.

 (2) I think the code supports running with MPI and OpenMP... I don't know
if there will be any difference in performance if you choose to run 1 MPI
process per node with 8 OpenMP threads vs. simply using 8 MPI processes, but
it might be worth looking into.

 (3) Again, I have no first-hand knowledge of the code's performance under
different interconnects, but it *does* seem likely to make a difference...
chances are if you asked on the Cactus forums, there might be someone with
first-hand experience with this who could give you some more specific
information.

  Hope that helps, and if any of it isn't clear, I'll be happy to try to
clarify.  Good luck!

  Cheers,
  - Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080328/28093b64/attachment.htm>