<div dir="ltr"><span style="font-size:small">Awesome, I appreciate the feedback of all of you.</span><div style="font-size:small"><br></div><div style="font-size:small">@Karli: Just to be clear: my original idea is to wrap up and make PETSc C++ bindings in a clean C++-ish way. Then an added value would be expression templates ("easy" to implement once you have a clear infrastructure), with all the benefits and drawbacks, as we discussed. But this would just be an "extra" thing.</div><div style="font-size:small"><br></div><div style="font-size:small">For my computations: I really have to say that I must've messed up something (gcc not vectorizing some PETSc call), and I really do not expect a 4x improvemet from templates. If I manage, I will do some precise test and provide you a simple code to do so (if you are interested), since I am not sure I will have KNL available for those tests.</div><br><div class="gmail_quote"><div dir="ltr">On Wed, 5 Apr 2017 at 11:19 Karl Rupp <<a href="mailto:rupp@iue.tuwien.ac.at">rupp@iue.tuwien.ac.at</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Filippo,<br class="gmail_msg">

<br class="gmail_msg">

did you compile PETSc with the same level of optimization than your<br class="gmail_msg">

template code? In particular, did you turn debugging off for the timings?<br class="gmail_msg">

<br class="gmail_msg">

Either way, let me share some of the mental stages I've gone through<br class="gmail_msg">

with ViennaCL. It started out with the same premises as you provide:<br class="gmail_msg">

There is nice C++, compilers are 'now' mature enough, and everything can<br class="gmail_msg">

be made a lot 'easier' and 'simpler' for the user. This was in 2010,<br class="gmail_msg">

back when all this GPU computing took off.<br class="gmail_msg">

<br class="gmail_msg">

Now, fast-forward to this day, I consider the extensive use of C++ with<br class="gmail_msg">

expression templates and other fanciness to be the biggest mistake of<br class="gmail_msg">

ViennaCL. Here is an incomplete list of why:<br class="gmail_msg">

<br class="gmail_msg">

  a) The 'generic' interface mimics that of Boost.uBLAS, so you can just<br class="gmail_msg">

mix&match the two libraries. Problems: First, one replicates the really<br class="gmail_msg">

poor design choices in uBLAS. Second, most users didn't even *think*<br class="gmail_msg">

about using these generic routines for anything other than ViennaCL<br class="gmail_msg">

types. There are just many more computational scientists familiar with<br class="gmail_msg">

Fortran or C than with "modern C++".<br class="gmail_msg">

<br class="gmail_msg">

  b) Keeping the type system consistent. So you would probably introduce<br class="gmail_msg">

a matrix, a vector, and operator overloads. For example, take y = A * x<br class="gmail_msg">

for a matrix A and vectors y, x. Sooner or later, users will write y = R<br class="gmail_msg">

* A * P * x and all of a sudden your left-to-right associativity results<br class="gmail_msg">

in an evaluation as<br class="gmail_msg">

   y = ((R * A) * P) * x.<br class="gmail_msg">

(How to 'solve' this particular case: Introduce even more expression<br class="gmail_msg">

templates! See Blaze for an example. So you soon have hundreds of lines<br class="gmail_msg">

of code just to fix problems you never had with just calling functions.)<br class="gmail_msg">

<br class="gmail_msg">

  c) Deal with errors: Continuing with the above example, what if R * A<br class="gmail_msg">

runs out of memory? Sure, you can throw exceptions, but the resulting<br class="gmail_msg">

stack trace will contain all the expression template frenzy, plus you<br class="gmail_msg">

will have a hard time to find out whether the operation R * A or the<br class="gmail_msg">

subsequent multiplication with P causes the error.<br class="gmail_msg">

<br class="gmail_msg">

  d) Expression templates are limited to a single line of code. Every<br class="gmail_msg">

now and then, algorithms require repeated operations on different data.<br class="gmail_msg">

For example, you may have vector operations of the form<br class="gmail_msg">

   x_i <- x_{i-1} + alpha * p_{i-1}<br class="gmail_msg">

   r_i <- r_{i-1} - alpha * y_i<br class="gmail_msg">

   p_i <- r_i + beta * p_{i-1}<br class="gmail_msg">

(with alpha and beta being scalars, everything else being vectors)<br class="gmail_msg">

in a pipelined conjugate gradient formulation. The "C"-way of doing this<br class="gmail_msg">

is to grab the pointers to the data and compute all three vector<br class="gmail_msg">

operations in a single for-loop, thus saving on repeated data access<br class="gmail_msg">

from main memory. With expression templates, there is no way to make<br class="gmail_msg">

this as efficient, because the lifetime of expression templates is<br class="gmail_msg">

exactly one line of code.<br class="gmail_msg">

<br class="gmail_msg">

  e) Lack of a stable ABI. There is a good reason why many vendor<br class="gmail_msg">

libraries come with a C interface, not with a C++ interface. If you try<br class="gmail_msg">

to link C++ object files generated by different C++ compilers (it is<br class="gmail_msg">

enough to have different versions of the same compiler, see Microsoft),<br class="gmail_msg">

it is undefined behavior. In best case, it fails with a linker error. In<br class="gmail_msg">

worst case, it will produce random crashes right before the end of a<br class="gmail_msg">

one-month simulation that you needed for your paper to be submitted in a<br class="gmail_msg">

few days. If I remember correctly, one group got ambitious with C++ for<br class="gmail_msg">

writing an MPI library, running into these kinds of problems with the<br class="gmail_msg">

C++ standard library.<br class="gmail_msg">

<br class="gmail_msg">

  f) The entry bar gets way too high. PETSc's current code base allows<br class="gmail_msg">

several dozens of users to contribute to each release. If some problem<br class="gmail_msg">

shows up in a particular function, you just go there, hook up the<br class="gmail_msg">

debugger, and figure out what is going on. Now assume there are C++<br class="gmail_msg">

expression templates involved. Your stack trace may now spread over<br class="gmail_msg">

several screen pages. Once you navigate to the offending line, you find<br class="gmail_msg">

that a certain overload of a traits class coupled with two functors<br class="gmail_msg">

provides wrong results. The reverse lookup of how exactly the traits<br class="gmail_msg">

class was instantiated, which working set was passed in, and how those<br class="gmail_msg">

functors interact may easily take you many minutes to digest - if you<br class="gmail_msg">

are the guy who has written that piece of code. People new to PETSc are<br class="gmail_msg">

likely to just give up. Note that this is a problem for new library<br class="gmail_msg">

users across the spectrum of C++ libraries; you can find evidence e.g.<br class="gmail_msg">

on the Eigen mailinglist. In the end, C++ templates are leaky<br class="gmail_msg">

abstractions and their use will sooner or later hit the user.<br class="gmail_msg">

<br class="gmail_msg">

Long story short: Keep in mind that there is a cost to "modern C++" in<br class="gmail_msg">

exchange for eventual performance benefits. For PETSc applications, the<br class="gmail_msg">

bottleneck is all too often not the performance of an expression that<br class="gmail_msg">

could be 'expression templated', but in areas where the use of C++<br class="gmail_msg">

wouldn't make a difference: algorithm selection, network performance, or<br class="gmail_msg">

poor process placement, just to name a few.<br class="gmail_msg">

<br class="gmail_msg">

Nonetheless, don't feel set back; you may have better ideas and know<br class="gmail_msg">

better ways of dealing with C++ than we do. A lightweight wrapper on top<br class="gmail_msg">

of PETSc, similar to what petsc4py does for Python, is something that a<br class="gmail_msg">

share of users may find handy. Whether it's worth the extra coding and<br class="gmail_msg">

maintenance effort? I don't know.<br class="gmail_msg">

<br class="gmail_msg">

Best regards,<br class="gmail_msg">

Karli<br class="gmail_msg">

<br class="gmail_msg">

<br class="gmail_msg">

On 04/05/2017 07:42 AM, Filippo Leonardi wrote:<br class="gmail_msg">

> @jed: You assembly is what I would've expected. Let me simplify my code<br class="gmail_msg">

> and see if I can provide a useful test example. (also: I assume your<br class="gmail_msg">

> assembly is for xeon, so I should definitely use avx512).<br class="gmail_msg">

><br class="gmail_msg">

> Let me get back at you in a few days (work permitting) with something<br class="gmail_msg">

> you can use.<br class="gmail_msg">

><br class="gmail_msg">

> From your example I wouldn't expect any benefit with my code compared to<br class="gmail_msg">

> just calling petsc (for those simple kernels).<br class="gmail_msg">

><br class="gmail_msg">

> A big plus I hadn't thought of, would be that the compiler is really<br class="gmail_msg">

> forced to vectorise (like in my case, where I might'have messed up some<br class="gmail_msg">

> config parameter).<br class="gmail_msg">

><br class="gmail_msg">

> @barry: I'm definitely too young to comment here (i.e. it's me that<br class="gmail_msg">

> changed, not the world). Definitely this is not new stuff, and, for<br class="gmail_msg">

> instance, Armadillo/boost/Eigen have been successfully production ready<br class="gmail_msg">

> for many years now. I have somehow the impression that now that c++11 is<br class="gmail_msg">

> more mainstream, it is much easier to write easily readable/maintainable<br class="gmail_msg">

> code (still ugly as hell tough). I think we can now give for granted a<br class="gmail_msg">

> c++11 compiler on any "supercomputer", and even c++14 and soon c++17...<br class="gmail_msg">

> and this makes development and interfaces much nicer.<br class="gmail_msg">

><br class="gmail_msg">

> What I would like to see is something like PETSc (where I have nice,<br class="gmail_msg">

> hidden MPI calls for instance), combined with the niceness of those<br class="gmail_msg">

> libraries (where many operations can be written in a, if I might say so,<br class="gmail_msg">

> more natural way). (My plan is: you did all the hard work, C++ can put a<br class="gmail_msg">

> ribbon on it and see what comes out.)<br class="gmail_msg">

><br class="gmail_msg">

> On 5 Apr 2017 5:39 am, "Jed Brown" <<a href="mailto:jed@jedbrown.org" class="gmail_msg" target="_blank">jed@jedbrown.org</a><br class="gmail_msg">

> <mailto:<a href="mailto:jed@jedbrown.org" class="gmail_msg" target="_blank">jed@jedbrown.org</a>>> wrote:<br class="gmail_msg">

><br class="gmail_msg">

>     Matthew Knepley <<a href="mailto:knepley@gmail.com" class="gmail_msg" target="_blank">knepley@gmail.com</a> <mailto:<a href="mailto:knepley@gmail.com" class="gmail_msg" target="_blank">knepley@gmail.com</a>>> writes:<br class="gmail_msg">

><br class="gmail_msg">

>     > On Tue, Apr 4, 2017 at 10:02 PM, Jed Brown <<a href="mailto:jed@jedbrown.org" class="gmail_msg" target="_blank">jed@jedbrown.org</a><br class="gmail_msg">

>     <mailto:<a href="mailto:jed@jedbrown.org" class="gmail_msg" target="_blank">jed@jedbrown.org</a>>> wrote:<br class="gmail_msg">

>     ><br class="gmail_msg">

>     >> Matthew Knepley <<a href="mailto:knepley@gmail.com" class="gmail_msg" target="_blank">knepley@gmail.com</a> <mailto:<a href="mailto:knepley@gmail.com" class="gmail_msg" target="_blank">knepley@gmail.com</a>>><br class="gmail_msg">

>     writes:<br class="gmail_msg">

>     >><br class="gmail_msg">

>     >> > On Tue, Apr 4, 2017 at 3:40 PM, Filippo Leonardi<br class="gmail_msg">

>     <<a href="mailto:filippo.leon@gmail.com" class="gmail_msg" target="_blank">filippo.leon@gmail.com</a> <mailto:<a href="mailto:filippo.leon@gmail.com" class="gmail_msg" target="_blank">filippo.leon@gmail.com</a>><br class="gmail_msg">

>     >> ><br class="gmail_msg">

>     >> > wrote:<br class="gmail_msg">

>     >> ><br class="gmail_msg">

>     >> >> I had weird issues where gcc (that I am using for my tests<br class="gmail_msg">

>     right now)<br class="gmail_msg">

>     >> >> wasn't vectorising properly (even enabling all flags, from<br class="gmail_msg">

>     >> tree-vectorize,<br class="gmail_msg">

>     >> >> to mavx). According to my tests, I know the Intel compiler was<br class="gmail_msg">

>     a bit<br class="gmail_msg">

>     >> better<br class="gmail_msg">

>     >> >> at that.<br class="gmail_msg">

>     >> >><br class="gmail_msg">

>     >> ><br class="gmail_msg">

>     >> > We are definitely at the mercy of the compiler for this. Maybe<br class="gmail_msg">

>     Jed has an<br class="gmail_msg">

>     >> > idea why its not vectorizing.<br class="gmail_msg">

>     >><br class="gmail_msg">

>     >> Is this so bad?<br class="gmail_msg">

>     >><br class="gmail_msg">

>     >> 000000000024080e <VecMAXPY_Seq+0x2fe> mov    rax,QWORD PTR [rbp-0xb0]<br class="gmail_msg">

>     >> 0000000000240815 <VecMAXPY_Seq+0x305> add    ebx,0x1<br class="gmail_msg">

>     >> 0000000000240818 <VecMAXPY_Seq+0x308> vmulpd ymm0,ymm7,YMMWORD PTR<br class="gmail_msg">

>     >> [rax+r9*1]<br class="gmail_msg">

>     >> 000000000024081e <VecMAXPY_Seq+0x30e> mov    rax,QWORD PTR [rbp-0xa8]<br class="gmail_msg">

>     >> 0000000000240825 <VecMAXPY_Seq+0x315> vfmadd231pd<br class="gmail_msg">

>     ymm0,ymm8,YMMWORD PTR<br class="gmail_msg">

>     >> [rax+r9*1]<br class="gmail_msg">

>     >> 000000000024082b <VecMAXPY_Seq+0x31b> mov    rax,QWORD PTR [rbp-0xb8]<br class="gmail_msg">

>     >> 0000000000240832 <VecMAXPY_Seq+0x322> vfmadd231pd<br class="gmail_msg">

>     ymm0,ymm6,YMMWORD PTR<br class="gmail_msg">

>     >> [rax+r9*1]<br class="gmail_msg">

>     >> 0000000000240838 <VecMAXPY_Seq+0x328> vfmadd231pd<br class="gmail_msg">

>     ymm0,ymm5,YMMWORD PTR<br class="gmail_msg">

>     >> [r10+r9*1]<br class="gmail_msg">

>     >> 000000000024083e <VecMAXPY_Seq+0x32e> vaddpd ymm0,ymm0,YMMWORD PTR<br class="gmail_msg">

>     >> [r11+r9*1]<br class="gmail_msg">

>     >> 0000000000240844 <VecMAXPY_Seq+0x334> vmovapd YMMWORD PTR<br class="gmail_msg">

>     [r11+r9*1],ymm0<br class="gmail_msg">

>     >> 000000000024084a <VecMAXPY_Seq+0x33a> add    r9,0x20<br class="gmail_msg">

>     >> 000000000024084e <VecMAXPY_Seq+0x33e> cmp    DWORD PTR [rbp-0xa0],ebx<br class="gmail_msg">

>     >> 0000000000240854 <VecMAXPY_Seq+0x344> ja     000000000024080e<br class="gmail_msg">

>     >> <VecMAXPY_Seq+0x2fe><br class="gmail_msg">

>     >><br class="gmail_msg">

>     ><br class="gmail_msg">

>     > I agree that is what we should see. It cannot be what Fillippo has<br class="gmail_msg">

>     if he is<br class="gmail_msg">

>     > getting ~4x with the template stuff.<br class="gmail_msg">

><br class="gmail_msg">

>     I'm using gcc.  Fillippo, can you make an easy to run test that we can<br class="gmail_msg">

>     evaluate on Xeon and KNL?<br class="gmail_msg">

><br class="gmail_msg">

</blockquote></div></div>