[mpich-discuss] Scalability of Intel quad core (Harpertown) cluster

Fri Mar 28 13:18:09 CDT 2008

Hee,

Can you send us this code? I'm interested in seeing what is causing the 
communication time to go up so much.

  -- Pavan

On 03/28/2008 12:21 PM, Hee Il Kim wrote:
> Thanks all,
> 
> The Cactus code has a good scalability, especially with the latest 
> version of Carpet it shows a good scalabitly over 5000 cpu cores(?). I 
> tested both BSSN+PUGH and Whisky+Carpet benchmarks.  Not an expert, I'm 
> depending on the timing info shown by Cactus rather than use profiling 
> tools introduced by Pavan. The profiling info says that most of 
> communication time was taken to enforce boundary conditions. The total 
> wall clock time (including communication time) increases from ~ 700sec 
> (1CPU) to ~1500 sec (64CPU) whereas computation time only increases ~600 
> to ~800 sec. Here the problem sizes were take to be proportional to the 
> number of cpus.  So now I'm looking for the ways to reduce the 
> communication time.
> 
> I'm using Harpertown 5420 (2.5GHz). What makes me disappointed more is 
> this newest Xeon cpu cluster is not that superior to my old Pentium D 
> 930 cluster (3.0GHz) which having 4 nodes (8 cores) . I tested various 
> combinations of (node# x cpu#) and the results somewhat depends on the 
> combinations.
> 
> Hybrid run using "-openmp" option of Intel compilers made things worse 
> and had broke loadbalancing. Also the optimization options (even -O2) 
> made runs slower but did not break the load balancing.
> 
> I checked the bandwidth behavior mentioned by Elvedin. Could I change or 
> setup the message size and frequency in a runtime level or any other steps?
> 
> I have no idea how to improve the scalability and how serious it is. 
> Anyway it's a bit unsatisfatory.at <http://unsatisfatory.at> the moment 
> and  I hope I can find a better way from here.  I appreciate  all your 
> kind comments and suggestions.
> 
> Regards
> 
> Kim, Hee Il
> 
> 
> 
> 2008/3/28, Brian Dobbins <bdobbins at gmail.com <mailto:bdobbins at gmail.com>>:
> 
>     Hi,
> 
>       I don't use the Cactus code myself, but from what little I /do/
>     know of it, this might not be unexpected.  For starters, what do you
>     mean by 'bad scalability'?  I believe most (all?) benchmark cases
>     demonstrate what is called 'weak scaling' - that is, the problem
>     size increases along with the number of processors.  So, running on
>     1 processor gives you a wall-clock time of /n/ seconds, and running
>     on 2 processors will probably give you a problem size of /n/+<some
>     small number>.  That small number is the communication time of your
>     code.  Thus, running on 80 cores /will/ be slower than running on 1,
>     but it'll let you run a much larger system.
> 
>       (To clarify, unless you're specifically configuring a constant
>     problem size, you won't reduce your time to solution by increasing
>     your processors.)
> 
>       The next thing to consider is which benchmark you're using, as
>     some of them are more scalable than others.  You're likely to get
>     different results when looking at the 'Whiskey Carpet' benchmark vs.
>     the 'BSSN_PUGH' one.  You might wish to take a look at the benchmark
>     database at the Cactus website, and there are some PDF files with
>     more information, too, including a master's thesis on benchmark
>     performance.
> 
>       Finally, a few other slightly more technical things to consider are:
> 
>     (1) What kind of Harpertowns are you using?  Looking at the 5450 vs.
>     the 5472 (both 3.0 Ghz chips), the latter has more memory bandwidth,
>     and may scale better since the code does appear to make use of it. 
>     Using the CPU2006fp_rate CactusADM benchmarks as a first
>     approximation to parallel performance, the SPEC website shows that a
>     5450 gets a score of 101 and 73.1 when going from 1 -> 8 cores (and
>     larger is better here - this is throughput, not wall-clock time),
>     and the 5472 goes from 112 -> 84.8.  Why does this matter?  Well,
>     you'll probably get different results running an 8-core job when
>     running that as 8x1, 4x2, or 1x8 (cores x nodes).   This will impact
>     your benchmark results somewhat.
> 
>      (2) I think the code supports running with MPI and OpenMP... I
>     don't know if there will be any difference in performance if you
>     choose to run 1 MPI process per node with 8 OpenMP threads vs.
>     simply using 8 MPI processes, but it might be worth looking into.
> 
>      (3) Again, I have no first-hand knowledge of the code's performance
>     under different interconnects, but it /does/ seem likely to make a
>     difference... chances are if you asked on the Cactus forums, there
>     might be someone with first-hand experience with this who could give
>     you some more specific information. 
> 
>       Hope that helps, and if any of it isn't clear, I'll be happy to
>     try to clarify.  Good luck!
> 
>       Cheers,
>       - Brian
> 
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji