<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Sep 28, 2016 at 1:40 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

<br>

   Mr Hong Zhang has found that removing the manual unrolling from MatMult_SeqAIJ_Inode() (at least with inode size 2) results in a good bump in performance on KNL and pointed me to the Intel gospel <a href="https://software.intel.com/en-us/articles/avoid-manual-loop-unrolling" rel="noreferrer" target="_blank">https://software.intel.com/en-<wbr>us/articles/avoid-manual-loop-<wbr>unrolling</a> which we've always ignored in the past. It would be good try the unrolled and non-unrolled also on Xeon.<br>

<br>

   We've never done a good job of managing our unrolling, where, how and when we do it and macros for unrolling such as PetscSparseDensePlusDot. Intel would say just throw it all away.<br></blockquote><div><br></div><div>Talking to some colleagues at Intel, the only time they do manual unrolling is for cases with nice unit-stride accesses and in which they are using Intel intrinsic instructions.  Otherwise, it is best to rely on the compiler to do this.  If you know a really good reason that a particular unrolling factor should be used, you can suggest it to the compiler with "#pragma unroll (n)".<br><br></div><div>My guess is that, with the Intel compiler, at least, we are better off letting it do the unrolling.  I'm not sure about other compilers out there.<br><br></div><div>--Richard<br></div></div></div></div>