<div dir="ltr"><div dir="auto" class="gmail_msg">@jed: You assembly is what I would've expected. Let me simplify my code and see if I can provide a useful test example. (also: I assume your assembly is for xeon, so I should definitely use avx512).</div><div dir="auto" class="gmail_msg"><br></div><div dir="auto" class="gmail_msg">Let me get back at you in a few days (work permitting) with something you can use.<br><div dir="auto" class="gmail_msg"><br class="gmail_msg"></div><div dir="auto" class="gmail_msg">From your example I wouldn't expect any benefit with my code compared to just calling petsc (for those simple kernels).</div><div dir="auto" class="gmail_msg"><br class="gmail_msg"></div><div dir="auto" class="gmail_msg">A big plus I hadn't thought of, would be that the compiler is really forced to vectorise (like in my case, where I might'have messed up some config parameter).</div><div dir="auto" class="gmail_msg"><br class="gmail_msg"></div><div dir="auto" class="gmail_msg">@barry: I'm definitely too young to comment here (i.e. it's me that changed, not the world). Definitely this is not new stuff, and, for instance, Armadillo/boost/Eigen have been successfully production ready for many years now. I have somehow the impression that now that c++11 is more mainstream, it is much easier to write easily readable/maintainable code (still ugly as hell tough). I think we can now give for granted a c++11 compiler on any "supercomputer", and even c++14 and soon c++17... and this makes development and interfaces much nicer.</div><div dir="auto" class="gmail_msg"><br></div><div class="gmail_msg">What I would like to see is something like PETSc (where I have nice, hidden MPI calls for instance), combined with the niceness of those libraries (where many operations can be written in a, if I might say so, more natural way). (My plan is: you did all the hard work, C++ can put a ribbon on it and see what comes out.)</div></div><div class="gmail_extra gmail_msg"><br class="gmail_msg"><div class="gmail_quote gmail_msg">On 5 Apr 2017 5:39 am, "Jed Brown" <<a href="mailto:jed@jedbrown.org" class="gmail_msg" target="_blank">jed@jedbrown.org</a>> wrote:<br type="attribution" class="gmail_msg"><blockquote class="gmail_quote gmail_msg" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Matthew Knepley <<a href="mailto:knepley@gmail.com" class="gmail_msg" target="_blank">knepley@gmail.com</a>> writes:<br class="gmail_msg">

<br class="gmail_msg">

> On Tue, Apr 4, 2017 at 10:02 PM, Jed Brown <<a href="mailto:jed@jedbrown.org" class="gmail_msg" target="_blank">jed@jedbrown.org</a>> wrote:<br class="gmail_msg">

><br class="gmail_msg">

>> Matthew Knepley <<a href="mailto:knepley@gmail.com" class="gmail_msg" target="_blank">knepley@gmail.com</a>> writes:<br class="gmail_msg">

>><br class="gmail_msg">

>> > On Tue, Apr 4, 2017 at 3:40 PM, Filippo Leonardi <<a href="mailto:filippo.leon@gmail.com" class="gmail_msg" target="_blank">filippo.leon@gmail.com</a><br class="gmail_msg">

>> ><br class="gmail_msg">

>> > wrote:<br class="gmail_msg">

>> ><br class="gmail_msg">

>> >> I had weird issues where gcc (that I am using for my tests right now)<br class="gmail_msg">

>> >> wasn't vectorising properly (even enabling all flags, from<br class="gmail_msg">

>> tree-vectorize,<br class="gmail_msg">

>> >> to mavx). According to my tests, I know the Intel compiler was a bit<br class="gmail_msg">

>> better<br class="gmail_msg">

>> >> at that.<br class="gmail_msg">

>> >><br class="gmail_msg">

>> ><br class="gmail_msg">

>> > We are definitely at the mercy of the compiler for this. Maybe Jed has an<br class="gmail_msg">

>> > idea why its not vectorizing.<br class="gmail_msg">

>><br class="gmail_msg">

>> Is this so bad?<br class="gmail_msg">

>><br class="gmail_msg">

>> 000000000024080e <VecMAXPY_Seq+0x2fe> mov    rax,QWORD PTR [rbp-0xb0]<br class="gmail_msg">

>> 0000000000240815 <VecMAXPY_Seq+0x305> add    ebx,0x1<br class="gmail_msg">

>> 0000000000240818 <VecMAXPY_Seq+0x308> vmulpd ymm0,ymm7,YMMWORD PTR<br class="gmail_msg">

>> [rax+r9*1]<br class="gmail_msg">

>> 000000000024081e <VecMAXPY_Seq+0x30e> mov    rax,QWORD PTR [rbp-0xa8]<br class="gmail_msg">

>> 0000000000240825 <VecMAXPY_Seq+0x315> vfmadd231pd ymm0,ymm8,YMMWORD PTR<br class="gmail_msg">

>> [rax+r9*1]<br class="gmail_msg">

>> 000000000024082b <VecMAXPY_Seq+0x31b> mov    rax,QWORD PTR [rbp-0xb8]<br class="gmail_msg">

>> 0000000000240832 <VecMAXPY_Seq+0x322> vfmadd231pd ymm0,ymm6,YMMWORD PTR<br class="gmail_msg">

>> [rax+r9*1]<br class="gmail_msg">

>> 0000000000240838 <VecMAXPY_Seq+0x328> vfmadd231pd ymm0,ymm5,YMMWORD PTR<br class="gmail_msg">

>> [r10+r9*1]<br class="gmail_msg">

>> 000000000024083e <VecMAXPY_Seq+0x32e> vaddpd ymm0,ymm0,YMMWORD PTR<br class="gmail_msg">

>> [r11+r9*1]<br class="gmail_msg">

>> 0000000000240844 <VecMAXPY_Seq+0x334> vmovapd YMMWORD PTR [r11+r9*1],ymm0<br class="gmail_msg">

>> 000000000024084a <VecMAXPY_Seq+0x33a> add    r9,0x20<br class="gmail_msg">

>> 000000000024084e <VecMAXPY_Seq+0x33e> cmp    DWORD PTR [rbp-0xa0],ebx<br class="gmail_msg">

>> 0000000000240854 <VecMAXPY_Seq+0x344> ja     000000000024080e<br class="gmail_msg">

>> <VecMAXPY_Seq+0x2fe><br class="gmail_msg">

>><br class="gmail_msg">

><br class="gmail_msg">

> I agree that is what we should see. It cannot be what Fillippo has if he is<br class="gmail_msg">

> getting ~4x with the template stuff.<br class="gmail_msg">

<br class="gmail_msg">

I'm using gcc.  Fillippo, can you make an easy to run test that we can<br class="gmail_msg">

evaluate on Xeon and KNL?<br class="gmail_msg">

</blockquote></div></div>

</div>