<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, Nov 14, 2017 at 12:13 PM, Zhang, Hong <span dir="ltr"><<a href="mailto:hongzhang@anl.gov" target="_blank">hongzhang@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word;line-break:after-white-space">
<br>
<div><span class=""><br>
<blockquote type="cite">
<div>On Nov 13, 2017, at 10:49 PM, Xiangdong <<a href="mailto:epscodes@gmail.com" target="_blank">epscodes@gmail.com</a>> wrote:</div>
<br class="m_813468179034739954Apple-interchange-newline">
<div>
<div dir="ltr">1) How about the vectorization of BAIJ format?</div>
</div>
</blockquote>
<div><br>
</div>
</span><div>BAIJ kernels are optimized with manual unrolling, but not with AVX intrinsics. So the vectorization relies on the compiler's ability.</div>
<div>It may or may not get vectorized depending on the compiler's optimization decisions. But vectorization is not essential for the performance of most BAIJ kernels.</div></div></div></blockquote><div><br></div><div>I know that this has come up in previous discussions, but I'm guessing that the manual unrolling actually impedes the ability of many modern compilers to optimize the BAIJ calculations. I suppose we ought to have a switch to enable or disable the use of the unrolled versions? (And, further down the road, some sort of performance model to tell us what the setting for the switch should be...)<br></div><div><br></div><div>--Richard<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space"><div><span class="">
<br>
<blockquote type="cite">
<div>
<div dir="ltr">If the block size s is 2 or 4, would it be ideal for AVXs? Do I need to do anything special (more than AVX flag) for the compiler to vectorize it?</div>
</div>
</blockquote>
<div><br>
</div>
</span><div>In double precision, 4 would be good for AVX/AVX2, and 8 would be ideal for AVX512. But other block sizes would make vectorization less profitable because of the remainders.</div><span class="">
<br>
<blockquote type="cite">
<div>
<div dir="ltr">
<div>2) Could you please update the linear solver table to label the preconditioners/solvers compatible with ELL format?
<div><a href="http://www.mcs.anl.gov/petsc/documentation/linearsolvertable.html" target="_blank">http://www.mcs.anl.gov/petsc/<wbr>documentation/<wbr>linearsolvertable.html</a></div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div></span>
This is still in a working progress. The easiest thing to do would be to use ELL for the Jacobian matrix and other formats (e.g. AIJ) for the preconditioners.</div>
<div>Then you would not need to worry about which preconditioners are compatible. An example can be found at ts/examples/tutorials/<wbr>advection-diffusion-reaction/<wbr>ex5adj.c.</div>
<div>For preconditioners such as block jacobi and mg (with bjacobi or with sor), you can use ELL for both the preconditioner and the Jacobian,</div>
<div>and expect a considerable gain since MatMult is the dominating operation.</div>
<div><br>
</div>
<div>The makefile for ex5adj includes a few use cases that demonstrate how ELL plays with various preconditioners.</div>
<div><br>
</div>
<div>Hong (Mr.)</div><span class="">
<div><br>
<blockquote type="cite">
<div>
<div dir="ltr">
<div>Thank you.</div>
<div><br>
</div>
<div>Xiangdong</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Mon, Nov 13, 2017 at 11:32 AM, Zhang, Hong <span dir="ltr">
<<a href="mailto:hongzhang@anl.gov" target="_blank">hongzhang@anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Most operations in PETSc would not benefit much from vectorization since they are memory-bounded. But this does not discourage you from compiling PETSc with AVX2/AVX512. We have added a new matrix format (currently named ELL, but will be changed to SELL shortly)
that can make MatMult ~2X faster than the AIJ format. The MatMult kernel is hand-optimized with AVX intrinsics. It works on any Intel processors that support AVX or AVX2 or AVX512, e.g. Haswell, Broadwell, Xeon Phi, Skylake. On the other hand, we have been
optimizing the AIJ MatMult kernel for these architectures as well. And one has to use AVX compiler flags in order to take advantage of the optimized kernels and the new matrix format.<br>
<br>
Hong (Mr.)<br>
<div class="m_813468179034739954HOEnZb">
<div class="m_813468179034739954h5"><br>
> On Nov 12, 2017, at 10:35 PM, Xiangdong <<a href="mailto:epscodes@gmail.com" target="_blank">epscodes@gmail.com</a>> wrote:<br>
><br>
> Hello everyone,<br>
><br>
> Can someone comment on the vectorization of PETSc? For example, for the MatMult function, will it perform better or run faster if it is compiled with avx2 or avx512?<br>
><br>
> Thank you.<br>
><br>
> Best,<br>
> Xiangdong<br>
<br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
</div>
<br>
</span></div>
</blockquote></div><br></div></div>