Hello all, specially Dr. Matt, <br><br>

<div><span class="gmail_quote">On 4/16/08, <b class="gmail_sendername">Matthew Knepley</b> &lt;<a href="mailto:knepley@gmail.com">knepley@gmail.com</a>&gt; wrote:</span>

<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie &lt;<a href="mailto:rlmackie862@gmail.com">rlmackie862@gmail.com</a>&gt; wrote:<br>

&gt; I&#39;m running my PETSc code on a cluster of quad core Xeon&#39;s connected<br>&gt;&nbsp;&nbsp;by Infiniband. I hadn&#39;t much worried about the performance, because<br>&gt;&nbsp;&nbsp;everything seemed to be working quite well, but today I was actually<br>

&gt;&nbsp;&nbsp;comparing performance (wall clock time) for the same problem, but on<br>&gt;&nbsp;&nbsp;different combinations of CPUS.<br>&gt;<br>&gt;&nbsp;&nbsp;I find that my PETSc code is quite scalable until I start to use<br>&gt;&nbsp;&nbsp;multiple cores/cpu.<br>

&gt;<br>&gt;&nbsp;&nbsp;For example, the run time doesn&#39;t improve by going from 1 core/cpu<br>&gt;&nbsp;&nbsp;to 4 cores/cpu, and I find this to be very strange, especially since<br>&gt;&nbsp;&nbsp;looking at top or Ganglia, all 4 cpus on each node are running at 100%<br>

&gt; almost<br>&gt;&nbsp;&nbsp;all of the time. I would have thought if the cpus were going all out,<br>&gt;&nbsp;&nbsp;that I would still be getting much more scalable results.<br><br>Those a really coarse measures. There is absolutely no way that all cores<br>

are going 100%. Its easy to show by hand. Take the peak flop rate and<br>this gives you the bandwidth needed to sustain that computation (if<br>everything is perfect, like axpy). You will find that the chip bandwidth<br>is far below this. A nice analysis is in<br>

<br><a href="http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf">http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf</a><br><br>&gt;&nbsp;&nbsp;We are using mvapich-0.9.9 with infiniband. So, I don&#39;t know if<br>&gt;&nbsp;&nbsp;this is a cluster/Xeon issue, or something else.<br>

<br>This is actually mathematics! How satisfying. The only way to improve<br>this is to change the data structure (e.g. use blocks) or change the<br>algorithm (e.g. use spectral elements and unassembled structures)</blockquote>


<div>&nbsp;</div>

<div>Would you please explain a bit about &quot;unassembled structures&quot;? </div>

<div>Does Discontinuous Galerkin Method falls into this category?</div>

<div>&nbsp;</div>

<div>Thanks and Regrads,</div>

<div>Amjad Ali.</div><br>

<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">Matt<br><br>&gt;&nbsp;&nbsp;Anybody with experience on this?<br>&gt;<br>&gt;&nbsp;&nbsp;Thanks, Randy M.<br>&gt;<br>&gt;<br><br>

<br><br>--<br>What most experimenters take for granted before they begin their<br>experiments is infinitely more interesting than any results to which<br>their experiments lead.<br>-- Norbert Wiener<br><br></blockquote></div>

<br>