[petsc-users] PETSc with modern C++

Filippo Leonardi filippo.leon at gmail.com
Wed Apr 5 06:39:53 CDT 2017


Awesome, I appreciate the feedback of all of you.

@Karli: Just to be clear: my original idea is to wrap up and make PETSc C++
bindings in a clean C++-ish way. Then an added value would be expression
templates ("easy" to implement once you have a clear infrastructure), with
all the benefits and drawbacks, as we discussed. But this would just be an
"extra" thing.

For my computations: I really have to say that I must've messed up
something (gcc not vectorizing some PETSc call), and I really do not expect
a 4x improvemet from templates. If I manage, I will do some precise test
and provide you a simple code to do so (if you are interested), since I am
not sure I will have KNL available for those tests.

On Wed, 5 Apr 2017 at 11:19 Karl Rupp <rupp at iue.tuwien.ac.at> wrote:

> Hi Filippo,
>
> did you compile PETSc with the same level of optimization than your
> template code? In particular, did you turn debugging off for the timings?
>
> Either way, let me share some of the mental stages I've gone through
> with ViennaCL. It started out with the same premises as you provide:
> There is nice C++, compilers are 'now' mature enough, and everything can
> be made a lot 'easier' and 'simpler' for the user. This was in 2010,
> back when all this GPU computing took off.
>
> Now, fast-forward to this day, I consider the extensive use of C++ with
> expression templates and other fanciness to be the biggest mistake of
> ViennaCL. Here is an incomplete list of why:
>
>   a) The 'generic' interface mimics that of Boost.uBLAS, so you can just
> mix&match the two libraries. Problems: First, one replicates the really
> poor design choices in uBLAS. Second, most users didn't even *think*
> about using these generic routines for anything other than ViennaCL
> types. There are just many more computational scientists familiar with
> Fortran or C than with "modern C++".
>
>   b) Keeping the type system consistent. So you would probably introduce
> a matrix, a vector, and operator overloads. For example, take y = A * x
> for a matrix A and vectors y, x. Sooner or later, users will write y = R
> * A * P * x and all of a sudden your left-to-right associativity results
> in an evaluation as
>    y = ((R * A) * P) * x.
> (How to 'solve' this particular case: Introduce even more expression
> templates! See Blaze for an example. So you soon have hundreds of lines
> of code just to fix problems you never had with just calling functions.)
>
>   c) Deal with errors: Continuing with the above example, what if R * A
> runs out of memory? Sure, you can throw exceptions, but the resulting
> stack trace will contain all the expression template frenzy, plus you
> will have a hard time to find out whether the operation R * A or the
> subsequent multiplication with P causes the error.
>
>   d) Expression templates are limited to a single line of code. Every
> now and then, algorithms require repeated operations on different data.
> For example, you may have vector operations of the form
>    x_i <- x_{i-1} + alpha * p_{i-1}
>    r_i <- r_{i-1} - alpha * y_i
>    p_i <- r_i + beta * p_{i-1}
> (with alpha and beta being scalars, everything else being vectors)
> in a pipelined conjugate gradient formulation. The "C"-way of doing this
> is to grab the pointers to the data and compute all three vector
> operations in a single for-loop, thus saving on repeated data access
> from main memory. With expression templates, there is no way to make
> this as efficient, because the lifetime of expression templates is
> exactly one line of code.
>
>   e) Lack of a stable ABI. There is a good reason why many vendor
> libraries come with a C interface, not with a C++ interface. If you try
> to link C++ object files generated by different C++ compilers (it is
> enough to have different versions of the same compiler, see Microsoft),
> it is undefined behavior. In best case, it fails with a linker error. In
> worst case, it will produce random crashes right before the end of a
> one-month simulation that you needed for your paper to be submitted in a
> few days. If I remember correctly, one group got ambitious with C++ for
> writing an MPI library, running into these kinds of problems with the
> C++ standard library.
>
>   f) The entry bar gets way too high. PETSc's current code base allows
> several dozens of users to contribute to each release. If some problem
> shows up in a particular function, you just go there, hook up the
> debugger, and figure out what is going on. Now assume there are C++
> expression templates involved. Your stack trace may now spread over
> several screen pages. Once you navigate to the offending line, you find
> that a certain overload of a traits class coupled with two functors
> provides wrong results. The reverse lookup of how exactly the traits
> class was instantiated, which working set was passed in, and how those
> functors interact may easily take you many minutes to digest - if you
> are the guy who has written that piece of code. People new to PETSc are
> likely to just give up. Note that this is a problem for new library
> users across the spectrum of C++ libraries; you can find evidence e.g.
> on the Eigen mailinglist. In the end, C++ templates are leaky
> abstractions and their use will sooner or later hit the user.
>
> Long story short: Keep in mind that there is a cost to "modern C++" in
> exchange for eventual performance benefits. For PETSc applications, the
> bottleneck is all too often not the performance of an expression that
> could be 'expression templated', but in areas where the use of C++
> wouldn't make a difference: algorithm selection, network performance, or
> poor process placement, just to name a few.
>
> Nonetheless, don't feel set back; you may have better ideas and know
> better ways of dealing with C++ than we do. A lightweight wrapper on top
> of PETSc, similar to what petsc4py does for Python, is something that a
> share of users may find handy. Whether it's worth the extra coding and
> maintenance effort? I don't know.
>
> Best regards,
> Karli
>
>
> On 04/05/2017 07:42 AM, Filippo Leonardi wrote:
> > @jed: You assembly is what I would've expected. Let me simplify my code
> > and see if I can provide a useful test example. (also: I assume your
> > assembly is for xeon, so I should definitely use avx512).
> >
> > Let me get back at you in a few days (work permitting) with something
> > you can use.
> >
> > From your example I wouldn't expect any benefit with my code compared to
> > just calling petsc (for those simple kernels).
> >
> > A big plus I hadn't thought of, would be that the compiler is really
> > forced to vectorise (like in my case, where I might'have messed up some
> > config parameter).
> >
> > @barry: I'm definitely too young to comment here (i.e. it's me that
> > changed, not the world). Definitely this is not new stuff, and, for
> > instance, Armadillo/boost/Eigen have been successfully production ready
> > for many years now. I have somehow the impression that now that c++11 is
> > more mainstream, it is much easier to write easily readable/maintainable
> > code (still ugly as hell tough). I think we can now give for granted a
> > c++11 compiler on any "supercomputer", and even c++14 and soon c++17...
> > and this makes development and interfaces much nicer.
> >
> > What I would like to see is something like PETSc (where I have nice,
> > hidden MPI calls for instance), combined with the niceness of those
> > libraries (where many operations can be written in a, if I might say so,
> > more natural way). (My plan is: you did all the hard work, C++ can put a
> > ribbon on it and see what comes out.)
> >
> > On 5 Apr 2017 5:39 am, "Jed Brown" <jed at jedbrown.org
> > <mailto:jed at jedbrown.org>> wrote:
> >
> >     Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>>
> writes:
> >
> >     > On Tue, Apr 4, 2017 at 10:02 PM, Jed Brown <jed at jedbrown.org
> >     <mailto:jed at jedbrown.org>> wrote:
> >     >
> >     >> Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>>
> >     writes:
> >     >>
> >     >> > On Tue, Apr 4, 2017 at 3:40 PM, Filippo Leonardi
> >     <filippo.leon at gmail.com <mailto:filippo.leon at gmail.com>
> >     >> >
> >     >> > wrote:
> >     >> >
> >     >> >> I had weird issues where gcc (that I am using for my tests
> >     right now)
> >     >> >> wasn't vectorising properly (even enabling all flags, from
> >     >> tree-vectorize,
> >     >> >> to mavx). According to my tests, I know the Intel compiler was
> >     a bit
> >     >> better
> >     >> >> at that.
> >     >> >>
> >     >> >
> >     >> > We are definitely at the mercy of the compiler for this. Maybe
> >     Jed has an
> >     >> > idea why its not vectorizing.
> >     >>
> >     >> Is this so bad?
> >     >>
> >     >> 000000000024080e <VecMAXPY_Seq+0x2fe> mov    rax,QWORD PTR
> [rbp-0xb0]
> >     >> 0000000000240815 <VecMAXPY_Seq+0x305> add    ebx,0x1
> >     >> 0000000000240818 <VecMAXPY_Seq+0x308> vmulpd ymm0,ymm7,YMMWORD PTR
> >     >> [rax+r9*1]
> >     >> 000000000024081e <VecMAXPY_Seq+0x30e> mov    rax,QWORD PTR
> [rbp-0xa8]
> >     >> 0000000000240825 <VecMAXPY_Seq+0x315> vfmadd231pd
> >     ymm0,ymm8,YMMWORD PTR
> >     >> [rax+r9*1]
> >     >> 000000000024082b <VecMAXPY_Seq+0x31b> mov    rax,QWORD PTR
> [rbp-0xb8]
> >     >> 0000000000240832 <VecMAXPY_Seq+0x322> vfmadd231pd
> >     ymm0,ymm6,YMMWORD PTR
> >     >> [rax+r9*1]
> >     >> 0000000000240838 <VecMAXPY_Seq+0x328> vfmadd231pd
> >     ymm0,ymm5,YMMWORD PTR
> >     >> [r10+r9*1]
> >     >> 000000000024083e <VecMAXPY_Seq+0x32e> vaddpd ymm0,ymm0,YMMWORD PTR
> >     >> [r11+r9*1]
> >     >> 0000000000240844 <VecMAXPY_Seq+0x334> vmovapd YMMWORD PTR
> >     [r11+r9*1],ymm0
> >     >> 000000000024084a <VecMAXPY_Seq+0x33a> add    r9,0x20
> >     >> 000000000024084e <VecMAXPY_Seq+0x33e> cmp    DWORD PTR
> [rbp-0xa0],ebx
> >     >> 0000000000240854 <VecMAXPY_Seq+0x344> ja     000000000024080e
> >     >> <VecMAXPY_Seq+0x2fe>
> >     >>
> >     >
> >     > I agree that is what we should see. It cannot be what Fillippo has
> >     if he is
> >     > getting ~4x with the template stuff.
> >
> >     I'm using gcc.  Fillippo, can you make an easy to run test that we
> can
> >     evaluate on Xeon and KNL?
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170405/5a41501b/attachment.html>


More information about the petsc-users mailing list