I resend the first figure I sent because the y-label on the right plot was wrong (the right plot is the "transpose" of the left one). Of course, we have linear scaling in computing RHS and Matrix contrib. Also, the problem size when ranging nodes and cores was fixed (2.5M).<div>


<br></div><div>Rodrigo<br><div><br><div class="gmail_quote">On Fri, Nov 12, 2010 at 10:51 AM, Rodrigo R. Paz <span dir="ltr"><<a href="mailto:rodrigop@intec.unl.edu.ar">rodrigop@intec.unl.edu.ar</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><font color="#000000"><div>Hi, Aron,</div><div>IMHO, the main problem here is the low memory bandwidth (FSB) in Xeon E5420 nodes.</div>


<div>In sparse matrix-vector product, the ratio between floating point operations and memory accesses is low. Thus, the overall performance in this stage is mainly controlled by memory bandwidth.</div>

</font><div>We have also tested on i7 node (QPI  memory controller) and results are not appreciably improved (see the right plot of the attached figure). </div><div class="im"><div><br></div><div>Rodrigo</div><div>

<br></div><div>--</div>Rodrigo Paz<br>National Council for Scientific Research CONICET<br>CIMEC-INTEC-CONICET-UNL.<br>Güemes 3450. 3000, Santa Fe, Argentina.<br>Tel/Fax: +54-342-4511594, Fax: +54-342-4511169<br>

<br><br></div><div><div></div><div class="h5"><div class="gmail_quote">On Fri, Nov 12, 2010 at 10:32 AM, Aron Ahmadia <span dir="ltr"><<a href="mailto:aron.ahmadia@kaust.edu.sa" target="_blank">aron.ahmadia@kaust.edu.sa</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Rodrigo,<br>

<br>

These are interesting results.  It looks like you were bound by a<br>

speedup of about 2, which suggests you might have been seeing cache<br>

capacity/conflict problems.  Did you do any further analysis on why<br>

you weren't able to get better performance?<br>

<font color="#888888"><br>

A<br>

</font><div><div></div><div><br>

On Fri, Nov 12, 2010 at 8:26 AM, Rodrigo R. Paz<br>

<<a href="mailto:rodrigop@intec.unl.edu.ar" target="_blank">rodrigop@intec.unl.edu.ar</a>> wrote:<br>

> Hi all,<br>

> find attached a plot with some results (speedup) that we have obtained some<br>

> time ago with some hacks we introduced to petsc in order to be used on<br>

> hybrid archs using openmp.<br>

> The tests were done in a set of 6 Xeon nodes with 8 cores each. Results are<br>

> for the MatMult op in KSP in the context of the solution of<br>

> advection-diffusion-reaction eqs by means of SUPG stabilized FEM.<br>

><br>

> Rodrigo<br>

><br>

> --<br>

> Rodrigo Paz<br>

> National Council for Scientific Research CONICET<br>

> CIMEC-INTEC-CONICET-UNL.<br>

> Güemes 3450. 3000, Santa Fe, Argentina.<br>

> Tel/Fax: +54-342-4511594, Fax: +54-342-4511169<br>

><br>

><br>

> On Thu, Nov 11, 2010 at 10:34 PM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>

>><br>

>> On Nov 11, 2010, at 7:22 PM, Jed Brown wrote:<br>

>><br>

>> > On Fri, Nov 12, 2010 at 02:18, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>

>> > How do you get adaptive load balancing (across the cores inside a<br>

>> > process) if you have OpenMP compiler decide the partitioning/parallelism?<br>

>> > This was Bill's point in why not to use OpenMP. For example if you give each<br>

>> > core the same amount of work up front they will end not ending at the same<br>

>> > time so you have wasted cycles.<br>

>> ><br>

>> > Hmm, I think this issue is largely subordinate to the memory locality<br>

>> > (for the sort of work we usually care about), but the OpenMP could be more<br>

>> > dynamic about distributing work.  I.e. this could be an OpenMP<br>

>> > implementation or tuning issue, but I don't see it as a fundamental<br>

>> > disadvantage of that programming model.  I could be wrong.<br>

>><br>

>>   You are probably right, your previous explanation was better.  Here is<br>

>> something related that Bill and I discussed, static load balance has lower<br>

>> overhead while dynamic has more overhead. Static load balancing however will<br>

>> end up with some in-balance. Thus one could do an upfront static load<br>

>> balancing of most of the data then when the first cores run out of their<br>

>> static work they do the rest of the work with the dynamic balancing.<br>

>><br>

>>   Barry<br>

>><br>

>> ><br>

>> > Jed<br>

>><br>

><br>

><br>

</div></div></blockquote></div><br>

</div></div></blockquote></div><br></div></div>