[petsc-users] PETSc with modern C++

Wed Apr 5 04:19:52 CDT 2017

Hi Filippo,

did you compile PETSc with the same level of optimization than your 
template code? In particular, did you turn debugging off for the timings?

Either way, let me share some of the mental stages I've gone through 
with ViennaCL. It started out with the same premises as you provide: 
There is nice C++, compilers are 'now' mature enough, and everything can 
be made a lot 'easier' and 'simpler' for the user. This was in 2010, 
back when all this GPU computing took off.

Now, fast-forward to this day, I consider the extensive use of C++ with 
expression templates and other fanciness to be the biggest mistake of 
ViennaCL. Here is an incomplete list of why:

  a) The 'generic' interface mimics that of Boost.uBLAS, so you can just 
mix&match the two libraries. Problems: First, one replicates the really 
poor design choices in uBLAS. Second, most users didn't even *think* 
about using these generic routines for anything other than ViennaCL 
types. There are just many more computational scientists familiar with 
Fortran or C than with "modern C++".

  b) Keeping the type system consistent. So you would probably introduce 
a matrix, a vector, and operator overloads. For example, take y = A * x 
for a matrix A and vectors y, x. Sooner or later, users will write y = R 
* A * P * x and all of a sudden your left-to-right associativity results 
in an evaluation as
   y = ((R * A) * P) * x.
(How to 'solve' this particular case: Introduce even more expression 
templates! See Blaze for an example. So you soon have hundreds of lines 
of code just to fix problems you never had with just calling functions.)

  c) Deal with errors: Continuing with the above example, what if R * A 
runs out of memory? Sure, you can throw exceptions, but the resulting 
stack trace will contain all the expression template frenzy, plus you 
will have a hard time to find out whether the operation R * A or the 
subsequent multiplication with P causes the error.

  d) Expression templates are limited to a single line of code. Every 
now and then, algorithms require repeated operations on different data. 
For example, you may have vector operations of the form
   x_i <- x_{i-1} + alpha * p_{i-1}
   r_i <- r_{i-1} - alpha * y_i
   p_i <- r_i + beta * p_{i-1}
(with alpha and beta being scalars, everything else being vectors)
in a pipelined conjugate gradient formulation. The "C"-way of doing this 
is to grab the pointers to the data and compute all three vector 
operations in a single for-loop, thus saving on repeated data access 
from main memory. With expression templates, there is no way to make 
this as efficient, because the lifetime of expression templates is 
exactly one line of code.

  e) Lack of a stable ABI. There is a good reason why many vendor 
libraries come with a C interface, not with a C++ interface. If you try 
to link C++ object files generated by different C++ compilers (it is 
enough to have different versions of the same compiler, see Microsoft), 
it is undefined behavior. In best case, it fails with a linker error. In 
worst case, it will produce random crashes right before the end of a 
one-month simulation that you needed for your paper to be submitted in a 
few days. If I remember correctly, one group got ambitious with C++ for 
writing an MPI library, running into these kinds of problems with the 
C++ standard library.

  f) The entry bar gets way too high. PETSc's current code base allows 
several dozens of users to contribute to each release. If some problem 
shows up in a particular function, you just go there, hook up the 
debugger, and figure out what is going on. Now assume there are C++ 
expression templates involved. Your stack trace may now spread over 
several screen pages. Once you navigate to the offending line, you find 
that a certain overload of a traits class coupled with two functors 
provides wrong results. The reverse lookup of how exactly the traits 
class was instantiated, which working set was passed in, and how those 
functors interact may easily take you many minutes to digest - if you 
are the guy who has written that piece of code. People new to PETSc are 
likely to just give up. Note that this is a problem for new library 
users across the spectrum of C++ libraries; you can find evidence e.g. 
on the Eigen mailinglist. In the end, C++ templates are leaky 
abstractions and their use will sooner or later hit the user.

Long story short: Keep in mind that there is a cost to "modern C++" in 
exchange for eventual performance benefits. For PETSc applications, the 
bottleneck is all too often not the performance of an expression that 
could be 'expression templated', but in areas where the use of C++ 
wouldn't make a difference: algorithm selection, network performance, or 
poor process placement, just to name a few.

Nonetheless, don't feel set back; you may have better ideas and know 
better ways of dealing with C++ than we do. A lightweight wrapper on top 
of PETSc, similar to what petsc4py does for Python, is something that a 
share of users may find handy. Whether it's worth the extra coding and 
maintenance effort? I don't know.

Best regards,
Karli

On 04/05/2017 07:42 AM, Filippo Leonardi wrote:
> @jed: You assembly is what I would've expected. Let me simplify my code
> and see if I can provide a useful test example. (also: I assume your
> assembly is for xeon, so I should definitely use avx512).
>
> Let me get back at you in a few days (work permitting) with something
> you can use.
>
> From your example I wouldn't expect any benefit with my code compared to
> just calling petsc (for those simple kernels).
>
> A big plus I hadn't thought of, would be that the compiler is really
> forced to vectorise (like in my case, where I might'have messed up some
> config parameter).
>
> @barry: I'm definitely too young to comment here (i.e. it's me that
> changed, not the world). Definitely this is not new stuff, and, for
> instance, Armadillo/boost/Eigen have been successfully production ready
> for many years now. I have somehow the impression that now that c++11 is
> more mainstream, it is much easier to write easily readable/maintainable
> code (still ugly as hell tough). I think we can now give for granted a
> c++11 compiler on any "supercomputer", and even c++14 and soon c++17...
> and this makes development and interfaces much nicer.
>
> What I would like to see is something like PETSc (where I have nice,
> hidden MPI calls for instance), combined with the niceness of those
> libraries (where many operations can be written in a, if I might say so,
> more natural way). (My plan is: you did all the hard work, C++ can put a
> ribbon on it and see what comes out.)
>
> On 5 Apr 2017 5:39 am, "Jed Brown" <jed at jedbrown.org
> <mailto:jed at jedbrown.org>> wrote:
>
>     Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>> writes:
>
>     > On Tue, Apr 4, 2017 at 10:02 PM, Jed Brown <jed at jedbrown.org
>     <mailto:jed at jedbrown.org>> wrote:
>     >
>     >> Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>>
>     writes:
>     >>
>     >> > On Tue, Apr 4, 2017 at 3:40 PM, Filippo Leonardi
>     <filippo.leon at gmail.com <mailto:filippo.leon at gmail.com>
>     >> >
>     >> > wrote:
>     >> >
>     >> >> I had weird issues where gcc (that I am using for my tests
>     right now)
>     >> >> wasn't vectorising properly (even enabling all flags, from
>     >> tree-vectorize,
>     >> >> to mavx). According to my tests, I know the Intel compiler was
>     a bit
>     >> better
>     >> >> at that.
>     >> >>
>     >> >
>     >> > We are definitely at the mercy of the compiler for this. Maybe
>     Jed has an
>     >> > idea why its not vectorizing.
>     >>
>     >> Is this so bad?
>     >>
>     >> 000000000024080e <VecMAXPY_Seq+0x2fe> mov    rax,QWORD PTR [rbp-0xb0]
>     >> 0000000000240815 <VecMAXPY_Seq+0x305> add    ebx,0x1
>     >> 0000000000240818 <VecMAXPY_Seq+0x308> vmulpd ymm0,ymm7,YMMWORD PTR
>     >> [rax+r9*1]
>     >> 000000000024081e <VecMAXPY_Seq+0x30e> mov    rax,QWORD PTR [rbp-0xa8]
>     >> 0000000000240825 <VecMAXPY_Seq+0x315> vfmadd231pd
>     ymm0,ymm8,YMMWORD PTR
>     >> [rax+r9*1]
>     >> 000000000024082b <VecMAXPY_Seq+0x31b> mov    rax,QWORD PTR [rbp-0xb8]
>     >> 0000000000240832 <VecMAXPY_Seq+0x322> vfmadd231pd
>     ymm0,ymm6,YMMWORD PTR
>     >> [rax+r9*1]
>     >> 0000000000240838 <VecMAXPY_Seq+0x328> vfmadd231pd
>     ymm0,ymm5,YMMWORD PTR
>     >> [r10+r9*1]
>     >> 000000000024083e <VecMAXPY_Seq+0x32e> vaddpd ymm0,ymm0,YMMWORD PTR
>     >> [r11+r9*1]
>     >> 0000000000240844 <VecMAXPY_Seq+0x334> vmovapd YMMWORD PTR
>     [r11+r9*1],ymm0
>     >> 000000000024084a <VecMAXPY_Seq+0x33a> add    r9,0x20
>     >> 000000000024084e <VecMAXPY_Seq+0x33e> cmp    DWORD PTR [rbp-0xa0],ebx
>     >> 0000000000240854 <VecMAXPY_Seq+0x344> ja     000000000024080e
>     >> <VecMAXPY_Seq+0x2fe>
>     >>
>     >
>     > I agree that is what we should see. It cannot be what Fillippo has
>     if he is
>     > getting ~4x with the template stuff.
>
>     I'm using gcc.  Fillippo, can you make an easy to run test that we can
>     evaluate on Xeon and KNL?
>