<div>I have been watching this, and the other, scalability thread.&nbsp; Here is my observation</div>  <div>on the dual/quad core boxes after spending the last few years playing with these boxes:</div>  <div>-&nbsp; your SW is the main factor in scalability.</div>  <div>-&nbsp; Scalability should be per physical CPU, not core.</div>  <div>-&nbsp; keeping all cores busy is likely to bring down the performance/core.&nbsp; </div>  <div>-&nbsp; the latest quad core CPUs are notoriusly bad in throughput if you use all/most the cores.</div>  <div>&nbsp;&nbsp; &nbsp;-- shared cache thrashing ???</div>  <div>-&nbsp; hyperthreading can reduce throughput</div>  <div>-&nbsp; memory bandwidth is likely the limiting factors on multi-core boxes, even on SUN's Niagara.</div>  <div>-&nbsp; you need to tune your algorithm according to the HW you have.&nbsp; You can't rely solely on MPICH to deliver the scalability.</div>  <div>-&nbsp; If you really want performance, use the simplest MPICH

 routines.&nbsp; I have reduced my MPICH calls to fixed point-point comms, except at entry where I have the only Barrier call.&nbsp; I don't even use ISend/IRecv (they are extremely bad for me for reasons I can't confirm).&nbsp; </div>  <div>&nbsp;</div>  <div>tan</div>  <div><BR><B><I>Pavan Balaji &lt;balaji@mcs.anl.gov&gt;</I></B> wrote:</div>  <BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid">Hee,<BR><BR>Can you send us this code? I'm interested in seeing what is causing the <BR>communication time to go up so much.<BR><BR>-- Pavan<BR><BR>On 03/28/2008 12:21 PM, Hee Il Kim wrote:<BR>&gt; Thanks all,<BR>&gt; <BR>&gt; The Cactus code has a good scalability, especially with the latest <BR>&gt; version of Carpet it shows a good scalabitly over 5000 cpu cores(?). I <BR>&gt; tested both BSSN+PUGH and Whisky+Carpet benchmarks. Not an expert, I'm <BR>&gt; depending on the timing info shown by Cactus rather than use profiling

 <BR>&gt; tools introduced by Pavan. The profiling info says that most of <BR>&gt; communication time was taken to enforce boundary conditions. The total <BR>&gt; wall clock time (including communication time) increases from ~ 700sec <BR>&gt; (1CPU) to ~1500 sec (64CPU) whereas computation time only increases ~600 <BR>&gt; to ~800 sec. Here the problem sizes were take to be proportional to the <BR>&gt; number of cpus. So now I'm looking for the ways to reduce the <BR>&gt; communication time.<BR>&gt; <BR>&gt; I'm using Harpertown 5420 (2.5GHz). What makes me disappointed more is <BR>&gt; this newest Xeon cpu cluster is not that superior to my old Pentium D <BR>&gt; 930 cluster (3.0GHz) which having 4 nodes (8 cores) . I tested various <BR>&gt; combinations of (node# x cpu#) and the results somewhat depends on the <BR>&gt; combinations.<BR>&gt; <BR>&gt; Hybrid run using "-openmp" option of Intel compilers made things worse <BR>&gt; and had broke loadbalancing. Also the

 optimization options (even -O2) <BR>&gt; made runs slower but did not break the load balancing.<BR>&gt; <BR>&gt; I checked the bandwidth behavior mentioned by Elvedin. Could I change or <BR>&gt; setup the message size and frequency in a runtime level or any other steps?<BR>&gt; <BR>&gt; I have no idea how to improve the scalability and how serious it is. <BR>&gt; Anyway it's a bit unsatisfatory.at <HTTP: unsatisfatory.at>the moment <BR>&gt; and I hope I can find a better way from here. I appreciate all your <BR>&gt; kind comments and suggestions.<BR>&gt; <BR>&gt; Regards<BR>&gt; <BR>&gt; Kim, Hee Il<BR>&gt; <BR>&gt; <BR>&gt; <BR>&gt; 2008/3/28, Brian Dobbins <BDOBBINS@GMAIL.COM <mailto:bdobbins@gmail.com>&gt;:<BR>&gt; <BR>&gt; Hi,<BR>&gt; <BR>&gt; I don't use the Cactus code myself, but from what little I /do/<BR>&gt; know of it, this might not be unexpected. For starters, what do you<BR>&gt; mean by 'bad scalability'? I believe most (all?) benchmark cases<BR>&gt;

 demonstrate what is called 'weak scaling' - that is, the problem<BR>&gt; size increases along with the number of processors. So, running on<BR>&gt; 1 processor gives you a wall-clock time of /n/ seconds, and running<BR>&gt; on 2 processors will probably give you a problem size of /n/+<SOME<BR>&gt; small number&gt;. That small number is the communication time of your<BR>&gt; code. Thus, running on 80 cores /will/ be slower than running on 1,<BR>&gt; but it'll let you run a much larger system.<BR>&gt; <BR>&gt; (To clarify, unless you're specifically configuring a constant<BR>&gt; problem size, you won't reduce your time to solution by increasing<BR>&gt; your processors.)<BR>&gt; <BR>&gt; The next thing to consider is which benchmark you're using, as<BR>&gt; some of them are more scalable than others. You're likely to get<BR>&gt; different results when looking at the 'Whiskey Carpet' benchmark vs.<BR>&gt; the 'BSSN_PUGH' one. You might wish to take a look at the

 benchmark<BR>&gt; database at the Cactus website, and there are some PDF files with<BR>&gt; more information, too, including a master's thesis on benchmark<BR>&gt; performance.<BR>&gt; <BR>&gt; Finally, a few other slightly more technical things to consider are:<BR>&gt; <BR>&gt; (1) What kind of Harpertowns are you using? Looking at the 5450 vs.<BR>&gt; the 5472 (both 3.0 Ghz chips), the latter has more memory bandwidth,<BR>&gt; and may scale better since the code does appear to make use of it. <BR>&gt; Using the CPU2006fp_rate CactusADM benchmarks as a first<BR>&gt; approximation to parallel performance, the SPEC website shows that a<BR>&gt; 5450 gets a score of 101 and 73.1 when going from 1 -&gt; 8 cores (and<BR>&gt; larger is better here - this is throughput, not wall-clock time),<BR>&gt; and the 5472 goes from 112 -&gt; 84.8. Why does this matter? Well,<BR>&gt; you'll probably get different results running an 8-core job when<BR>&gt; running that as 8x1, 4x2, or 1x8

 (cores x nodes). This will impact<BR>&gt; your benchmark results somewhat.<BR>&gt; <BR>&gt; (2) I think the code supports running with MPI and OpenMP... I<BR>&gt; don't know if there will be any difference in performance if you<BR>&gt; choose to run 1 MPI process per node with 8 OpenMP threads vs.<BR>&gt; simply using 8 MPI processes, but it might be worth looking into.<BR>&gt; <BR>&gt; (3) Again, I have no first-hand knowledge of the code's performance<BR>&gt; under different interconnects, but it /does/ seem likely to make a<BR>&gt; difference... chances are if you asked on the Cactus forums, there<BR>&gt; might be someone with first-hand experience with this who could give<BR>&gt; you some more specific information. <BR>&gt; <BR>&gt; Hope that helps, and if any of it isn't clear, I'll be happy to<BR>&gt; try to clarify. Good luck!<BR>&gt; <BR>&gt; Cheers,<BR>&gt; - Brian<BR>&gt; <BR>&gt; <BR><BR>-- <BR>Pavan

 Balaji<BR>http://www.mcs.anl.gov/~balaji<BR><BR></BLOCKQUOTE><BR><p>&#32;

      <hr size=1>Be a better friend, newshound, and 

know-it-all with Yahoo! Mobile. <a href="http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ "> Try it now.</a>