<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, Feb 13, 2018 at 8:48 PM, Jed Brown <span dir="ltr"><<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">Richard Tran Mills <<a href="mailto:rtmills@anl.gov">rtmills@anl.gov</a>> writes:<br>

<br>

> I haven't experimented very thoroughly with it (hmm... should probably do<br>

> such experiments), but I believe that, once matrix rows become sufficiently<br>

> long, then SELL doesn't provide an advantage over AIJ.<br>

<br>

</span>What is the performance model that explains why SELL doesn't benefit<br>

from long rows?  It's clear for large blocks where BAIJ can be compiled<br>

to vectorized loads and stores, but much less clear when AIJ is<br>

producing basically scalar code.<br></blockquote><div><br></div><div>Hmm. I may be mistaken in my statements about long rows and AIJ -- I think it might be the case that every matrix with long rows for which I've seen the decent performance with AIJ was something that uses and benefits from the Inode versions of the routines.<br><br></div><div>Why do you say that AIJ is producing basically scalar code? With the Intel 18.0 compiler and "-O3 -xMIC-AVX512" flags, the compiler report indicates that the main loop in MatMult_SeqAIJ() is vectorized and the estimate speedup is 3.410 -- not great, but not terrible, either. I unfortunately never got very good at interpreting the assembly files generated by the compiler -- I generally used VTune to look at the assembly for profiled loops, and I haven't tried to do this right now -- but my (possibly confused) look through the assembler output for that loop shows vector instructions like vfmadd231pd. Am I missing something? (Entirely possible.)<br><br></div><div>--Richard<br></div></div><br></div></div>