[petsc-users] PETSc with modern C++
Karl Rupp
rupp at iue.tuwien.ac.at
Wed Apr 5 04:19:52 CDT 2017
Hi Filippo,
did you compile PETSc with the same level of optimization than your
template code? In particular, did you turn debugging off for the timings?
Either way, let me share some of the mental stages I've gone through
with ViennaCL. It started out with the same premises as you provide:
There is nice C++, compilers are 'now' mature enough, and everything can
be made a lot 'easier' and 'simpler' for the user. This was in 2010,
back when all this GPU computing took off.
Now, fast-forward to this day, I consider the extensive use of C++ with
expression templates and other fanciness to be the biggest mistake of
ViennaCL. Here is an incomplete list of why:
a) The 'generic' interface mimics that of Boost.uBLAS, so you can just
mix&match the two libraries. Problems: First, one replicates the really
poor design choices in uBLAS. Second, most users didn't even *think*
about using these generic routines for anything other than ViennaCL
types. There are just many more computational scientists familiar with
Fortran or C than with "modern C++".
b) Keeping the type system consistent. So you would probably introduce
a matrix, a vector, and operator overloads. For example, take y = A * x
for a matrix A and vectors y, x. Sooner or later, users will write y = R
* A * P * x and all of a sudden your left-to-right associativity results
in an evaluation as
y = ((R * A) * P) * x.
(How to 'solve' this particular case: Introduce even more expression
templates! See Blaze for an example. So you soon have hundreds of lines
of code just to fix problems you never had with just calling functions.)
c) Deal with errors: Continuing with the above example, what if R * A
runs out of memory? Sure, you can throw exceptions, but the resulting
stack trace will contain all the expression template frenzy, plus you
will have a hard time to find out whether the operation R * A or the
subsequent multiplication with P causes the error.
d) Expression templates are limited to a single line of code. Every
now and then, algorithms require repeated operations on different data.
For example, you may have vector operations of the form
x_i <- x_{i-1} + alpha * p_{i-1}
r_i <- r_{i-1} - alpha * y_i
p_i <- r_i + beta * p_{i-1}
(with alpha and beta being scalars, everything else being vectors)
in a pipelined conjugate gradient formulation. The "C"-way of doing this
is to grab the pointers to the data and compute all three vector
operations in a single for-loop, thus saving on repeated data access
from main memory. With expression templates, there is no way to make
this as efficient, because the lifetime of expression templates is
exactly one line of code.
e) Lack of a stable ABI. There is a good reason why many vendor
libraries come with a C interface, not with a C++ interface. If you try
to link C++ object files generated by different C++ compilers (it is
enough to have different versions of the same compiler, see Microsoft),
it is undefined behavior. In best case, it fails with a linker error. In
worst case, it will produce random crashes right before the end of a
one-month simulation that you needed for your paper to be submitted in a
few days. If I remember correctly, one group got ambitious with C++ for
writing an MPI library, running into these kinds of problems with the
C++ standard library.
f) The entry bar gets way too high. PETSc's current code base allows
several dozens of users to contribute to each release. If some problem
shows up in a particular function, you just go there, hook up the
debugger, and figure out what is going on. Now assume there are C++
expression templates involved. Your stack trace may now spread over
several screen pages. Once you navigate to the offending line, you find
that a certain overload of a traits class coupled with two functors
provides wrong results. The reverse lookup of how exactly the traits
class was instantiated, which working set was passed in, and how those
functors interact may easily take you many minutes to digest - if you
are the guy who has written that piece of code. People new to PETSc are
likely to just give up. Note that this is a problem for new library
users across the spectrum of C++ libraries; you can find evidence e.g.
on the Eigen mailinglist. In the end, C++ templates are leaky
abstractions and their use will sooner or later hit the user.
Long story short: Keep in mind that there is a cost to "modern C++" in
exchange for eventual performance benefits. For PETSc applications, the
bottleneck is all too often not the performance of an expression that
could be 'expression templated', but in areas where the use of C++
wouldn't make a difference: algorithm selection, network performance, or
poor process placement, just to name a few.
Nonetheless, don't feel set back; you may have better ideas and know
better ways of dealing with C++ than we do. A lightweight wrapper on top
of PETSc, similar to what petsc4py does for Python, is something that a
share of users may find handy. Whether it's worth the extra coding and
maintenance effort? I don't know.
Best regards,
Karli
On 04/05/2017 07:42 AM, Filippo Leonardi wrote:
> @jed: You assembly is what I would've expected. Let me simplify my code
> and see if I can provide a useful test example. (also: I assume your
> assembly is for xeon, so I should definitely use avx512).
>
> Let me get back at you in a few days (work permitting) with something
> you can use.
>
> From your example I wouldn't expect any benefit with my code compared to
> just calling petsc (for those simple kernels).
>
> A big plus I hadn't thought of, would be that the compiler is really
> forced to vectorise (like in my case, where I might'have messed up some
> config parameter).
>
> @barry: I'm definitely too young to comment here (i.e. it's me that
> changed, not the world). Definitely this is not new stuff, and, for
> instance, Armadillo/boost/Eigen have been successfully production ready
> for many years now. I have somehow the impression that now that c++11 is
> more mainstream, it is much easier to write easily readable/maintainable
> code (still ugly as hell tough). I think we can now give for granted a
> c++11 compiler on any "supercomputer", and even c++14 and soon c++17...
> and this makes development and interfaces much nicer.
>
> What I would like to see is something like PETSc (where I have nice,
> hidden MPI calls for instance), combined with the niceness of those
> libraries (where many operations can be written in a, if I might say so,
> more natural way). (My plan is: you did all the hard work, C++ can put a
> ribbon on it and see what comes out.)
>
> On 5 Apr 2017 5:39 am, "Jed Brown" <jed at jedbrown.org
> <mailto:jed at jedbrown.org>> wrote:
>
> Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>> writes:
>
> > On Tue, Apr 4, 2017 at 10:02 PM, Jed Brown <jed at jedbrown.org
> <mailto:jed at jedbrown.org>> wrote:
> >
> >> Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>>
> writes:
> >>
> >> > On Tue, Apr 4, 2017 at 3:40 PM, Filippo Leonardi
> <filippo.leon at gmail.com <mailto:filippo.leon at gmail.com>
> >> >
> >> > wrote:
> >> >
> >> >> I had weird issues where gcc (that I am using for my tests
> right now)
> >> >> wasn't vectorising properly (even enabling all flags, from
> >> tree-vectorize,
> >> >> to mavx). According to my tests, I know the Intel compiler was
> a bit
> >> better
> >> >> at that.
> >> >>
> >> >
> >> > We are definitely at the mercy of the compiler for this. Maybe
> Jed has an
> >> > idea why its not vectorizing.
> >>
> >> Is this so bad?
> >>
> >> 000000000024080e <VecMAXPY_Seq+0x2fe> mov rax,QWORD PTR [rbp-0xb0]
> >> 0000000000240815 <VecMAXPY_Seq+0x305> add ebx,0x1
> >> 0000000000240818 <VecMAXPY_Seq+0x308> vmulpd ymm0,ymm7,YMMWORD PTR
> >> [rax+r9*1]
> >> 000000000024081e <VecMAXPY_Seq+0x30e> mov rax,QWORD PTR [rbp-0xa8]
> >> 0000000000240825 <VecMAXPY_Seq+0x315> vfmadd231pd
> ymm0,ymm8,YMMWORD PTR
> >> [rax+r9*1]
> >> 000000000024082b <VecMAXPY_Seq+0x31b> mov rax,QWORD PTR [rbp-0xb8]
> >> 0000000000240832 <VecMAXPY_Seq+0x322> vfmadd231pd
> ymm0,ymm6,YMMWORD PTR
> >> [rax+r9*1]
> >> 0000000000240838 <VecMAXPY_Seq+0x328> vfmadd231pd
> ymm0,ymm5,YMMWORD PTR
> >> [r10+r9*1]
> >> 000000000024083e <VecMAXPY_Seq+0x32e> vaddpd ymm0,ymm0,YMMWORD PTR
> >> [r11+r9*1]
> >> 0000000000240844 <VecMAXPY_Seq+0x334> vmovapd YMMWORD PTR
> [r11+r9*1],ymm0
> >> 000000000024084a <VecMAXPY_Seq+0x33a> add r9,0x20
> >> 000000000024084e <VecMAXPY_Seq+0x33e> cmp DWORD PTR [rbp-0xa0],ebx
> >> 0000000000240854 <VecMAXPY_Seq+0x344> ja 000000000024080e
> >> <VecMAXPY_Seq+0x2fe>
> >>
> >
> > I agree that is what we should see. It cannot be what Fillippo has
> if he is
> > getting ~4x with the template stuff.
>
> I'm using gcc. Fillippo, can you make an easy to run test that we can
> evaluate on Xeon and KNL?
>
More information about the petsc-users
mailing list