<p>That load imbalance often comes from whatever came *before* the reduction.</p>
<div class="gmail_quote">On May 26, 2012 5:25 PM, "Mark F. Adams" <<a href="mailto:mark.adams@columbia.edu">mark.adams@columbia.edu</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word">Just a few comments on the perf data. I see a lot (~5x) load imbalance on reduction stuff like VecMDot and VecNorm (Time Ratio, max/min) but even a lot of imbalance on simple non-collective vector ops like VecAXPY (~3x). So if the work is load balanced well then I would suspect its a NUMA issue.<div>
<br></div><div>Mark<br><div><br></div><div><div><div>On May 26, 2012, at 6:13 PM, Aron Roland wrote:</div><br><blockquote type="cite">
<div bgcolor="#FFFFFF" text="#000000">
Dear All,<br>
<br>
I have some question on some recent implementation of PETSc for
solving a large linear system from a 4d problem on hybrid
unstructured meshes. <br>
<br>
The point is that we have implemented all the mappings and the
solution is fine, the number of iterations too. The results are
robust with respect to the amount of CPU used but we have a scaling
issue. <br>
<br>
The system is an intel cluster of the latest generation on
Infiniband.<br>
<br>
We have attached the summary ... with hooefully a lot of
informations. <br>
<br>
Any comments, suggestions, ideas are very welcome. <br>
<br>
We have been reading the threads with that are dealing with
multi-core and the bus-limitation stuff, so we are aware of this. <br>
<br>
I am thinking now on an open/mpi hybrid stuff but I am not quite
happy with the bus-limitation explanation, most of the systems are
multicore. <br>
<br>
I hope the limitation are not the sparse matrix mapping that we are
using ... <br>
<br>
Thanks in advance ...<br>
<br>
Cheers<br>
<br>
Aron <br>
<br>
<br>
<br>
<b><br>
</b><br>
</div>
<span><benchmark.txt></span></blockquote></div><br></div></div></div></blockquote></div>