[mpich-discuss] Scalability of Intel quad core (Harpertown) cluster

Fri Mar 28 12:21:52 CDT 2008

Thanks all,

The Cactus code has a good scalability, especially with the latest version
of Carpet it shows a good scalabitly over 5000 cpu cores(?). I tested both
BSSN+PUGH and Whisky+Carpet benchmarks.  Not an expert, I'm depending on the
timing info shown by Cactus rather than use profiling tools introduced by
Pavan. The profiling info says that most of communication time was taken to
enforce boundary conditions. The total wall clock time (including
communication time) increases from ~ 700sec (1CPU) to ~1500 sec (64CPU)
whereas computation time only increases ~600 to ~800 sec. Here the problem
sizes were take to be proportional to the number of cpus.  So now I'm
looking for the ways to reduce the communication time.

I'm using Harpertown 5420 (2.5GHz). What makes me disappointed more is this
newest Xeon cpu cluster is not that superior to my old Pentium D 930 cluster
(3.0GHz) which having 4 nodes (8 cores) . I tested various combinations of
(node# x cpu#) and the results somewhat depends on the combinations.

Hybrid run using "-openmp" option of Intel compilers made things worse and
had broke loadbalancing. Also the optimization options (even -O2) made runs
slower but did not break the load balancing.

I checked the bandwidth behavior mentioned by Elvedin. Could I change or
setup the message size and frequency in a runtime level or any other steps?

I have no idea how to improve the scalability and how serious it is. Anyway
it's a bit unsatisfatory.at the moment and  I hope I can find a better way
from here.  I appreciate  all your kind comments and suggestions.

Regards

Kim, Hee Il

2008/3/28, Brian Dobbins <bdobbins at gmail.com>:
>
> Hi,
>
>   I don't use the Cactus code myself, but from what little I *do* know of
> it, this might not be unexpected.  For starters, what do you mean by 'bad
> scalability'?  I believe most (all?) benchmark cases demonstrate what is
> called 'weak scaling' - that is, the problem size increases along with the
> number of processors.  So, running on 1 processor gives you a wall-clock
> time of *n* seconds, and running on 2 processors will probably give you a
> problem size of *n*+<some small number>.  That small number is the
> communication time of your code.  Thus, running on 80 cores *will* be
> slower than running on 1, but it'll let you run a much larger system.
>
>   (To clarify, unless you're specifically configuring a constant problem
> size, you won't reduce your time to solution by increasing your processors.)
>
>   The next thing to consider is which benchmark you're using, as some of
> them are more scalable than others.  You're likely to get different results
> when looking at the 'Whiskey Carpet' benchmark vs. the 'BSSN_PUGH' one.  You
> might wish to take a look at the benchmark database at the Cactus website,
> and there are some PDF files with more information, too, including a
> master's thesis on benchmark performance.
>
>   Finally, a few other slightly more technical things to consider are:
>
> (1) What kind of Harpertowns are you using?  Looking at the 5450 vs. the
> 5472 (both 3.0 Ghz chips), the latter has more memory bandwidth, and may
> scale better since the code does appear to make use of it.  Using the
> CPU2006fp_rate CactusADM benchmarks as a first approximation to parallel
> performance, the SPEC website shows that a 5450 gets a score of 101 and
> 73.1 when going from 1 -> 8 cores (and larger is better here - this is
> throughput, not wall-clock time), and the 5472 goes from 112 -> 84.8.  Why
> does this matter?  Well, you'll probably get different results running an
> 8-core job when running that as 8x1, 4x2, or 1x8 (cores x nodes).   This
> will impact your benchmark results somewhat.
>
>  (2) I think the code supports running with MPI and OpenMP... I don't know
> if there will be any difference in performance if you choose to run 1 MPI
> process per node with 8 OpenMP threads vs. simply using 8 MPI processes, but
> it might be worth looking into.
>
>  (3) Again, I have no first-hand knowledge of the code's performance under
> different interconnects, but it *does* seem likely to make a difference...
> chances are if you asked on the Cactus forums, there might be someone with
> first-hand experience with this who could give you some more specific
> information.
>
>   Hope that helps, and if any of it isn't clear, I'll be happy to try to
> clarify.  Good luck!
>
>   Cheers,
>   - Brian
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080329/9e4a931c/attachment.htm>