[petsc-users] PETSc with modern C++

Matthew Knepley knepley at gmail.com
Tue Apr 4 14:39:29 CDT 2017


On Tue, Apr 4, 2017 at 1:19 PM, Filippo Leonardi <filippo.leon at gmail.com>
wrote:

> You are in fact right, it is the same speedup  of approximatively 2.5x
> (with 2 ranks), my brain rounded up to 3. (This was just a test done in 10
> min on my Workstation, so no pretence to be definite, I just wanted to have
> an indication).
>

Hmm, it seems like PetscKernelAXPY4() is just not vectorizing correctly
then. I would be interested to see your code.

As you say, I am using OpenBLAS, so I wouldn't be surprised of those
> results. If/when I use MKL (or something similar), I really do not expect
> such an improvement).
>
> Since you seem interested (if you are interested, I can give you all the
> details): the comparison I make, is with "petscxx" which is my template
> code (which uses a single loop) using AVX (I force PETSc to align the
> memory to 32 bit boundary and then I use packets of 4 doubles). Also notice
> that I use vectors with nice lengths, so there is no need to "peel" the end
> of the loop. The "PETSc" simulation is using PETSc's VecMAXPY.
>

Thanks,

   Matt


> On Tue, 4 Apr 2017 at 19:12 Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>>
>>   MAXPY isn't really a BLAS 1 since it can reuse some data in certain
>> vectors.
>>
>>
>> > On Apr 4, 2017, at 10:25 AM, Filippo Leonardi <filippo.leon at gmail.com>
>> wrote:
>> >
>> > I really appreciate the feedback. Thanks.
>> >
>> > That of deadlock, when the order of destruction is not preserved, is a
>> point I hadn't thought of. Maybe it can be cleverly addressed.
>> >
>> > PS: If you are interested, I ran some benchmark on BLAS1 stuff and, for
>> a single processor, I obtain:
>> >
>> > Example for MAXPY, with expression templates:
>> > BM_Vector_petscxx_MAXPY/8 38 ns 38 ns 18369805
>> > BM_Vector_petscxx_MAXPY/64 622 ns 622 ns 1364335
>> > BM_Vector_petscxx_MAXPY/512 281 ns 281 ns 2477718
>> > BM_Vector_petscxx_MAXPY/4096 2046 ns 2046 ns 349954
>> > BM_Vector_petscxx_MAXPY/32768 18012 ns 18012 ns 38788
>> > BM_Vector_petscxx_MAXPY_BigO 0.55 N 0.55 N
>> > BM_Vector_petscxx_MAXPY_RMS 7 % 7 %
>> > Direct call to MAXPY:
>> > BM_Vector_PETSc_MAXPY/8 33 ns 33 ns 20973674
>> > BM_Vector_PETSc_MAXPY/64 116 ns 116 ns 5992878
>> > BM_Vector_PETSc_MAXPY/512 731 ns 731 ns 963340
>> > BM_Vector_PETSc_MAXPY/4096 5739 ns 5739 ns 122414
>> > BM_Vector_PETSc_MAXPY/32768 46346 ns 46346 ns 15312
>> > BM_Vector_PETSc_MAXPY_BigO 1.41 N 1.41 N
>> > BM_Vector_PETSc_MAXPY_RMS 0 % 0 %
>> >
>> > And 3x speedup on 2 MPI ranks (not much communication here, anyway). I
>> am now convinced that this warrants some further investigation/testing.
>> >
>> >
>> > On Tue, 4 Apr 2017 at 01:08 Jed Brown <jed at jedbrown.org> wrote:
>> > Matthew Knepley <knepley at gmail.com> writes:
>> >
>> > >> BLAS. (Here a interesting point opens: I assume an efficient BLAS
>> > >>
>> > >> implementation, but I am not so sure about how the different BLAS do
>> > >> things
>> > >>
>> > >> internally. I work from the assumption that we have a very well
>> tuned BLAS
>> > >>
>> > >> implementation at our disposal).
>> > >>
>> > >
>> > > The speed improvement comes from pulling vectors through memory fewer
>> > > times by merging operations (kernel fusion).
>> >
>> > Typical examples are VecMAXPY and VecMDot, but note that these are not
>> > xGEMV because the vectors are independent arrays rather than single
>> > arrays with a constant leading dimension.
>> >
>> > >> call VecGetArray. However I will inevitably foget to return the
>> array to
>> > >>
>> > >> PETSc. I could have my new VecArray returning an object that
>> restores the
>> > >>
>> > >> array
>> > >>
>> > >> when it goes out of scope. I can also flag the function with
>> [[nodiscard]]
>> > >> to
>> > >>
>> > >> prevent the user to destroy the returned object from the start.
>> > >>
>> > >
>> > > Jed claims that this pattern is no longer preferred, but I have
>> forgotten
>> > > his argument.
>> > > Jed?
>> >
>> > Destruction order matters and needs to be collective.  If an error
>> > condition causes destruction to occur in a different order on different
>> > processes, you can get deadlock.  I would much rather have an error
>> > leave some resources (for the OS to collect) than escalate into
>> > deadlock.
>> >
>> > > We have had this discussion for years on this list. Having separate
>> names
>> > > for each type
>> > > is really ugly and does not achieve what we want. We want smooth
>> > > interoperability between
>> > > objects with different backing types, but it is still not clear how
>> to do
>> > > this.
>> >
>> > Hide it internally and implicitly promote.  Only the *GetArray functions
>> > need to be parametrized on numeric type.  But it's a lot of work on the
>> > backend.
>>
>>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170404/d8a53a5f/attachment.html>


More information about the petsc-users mailing list