[petsc-dev] Integrating PFLOTRAN, PETSC & SAMRAI

Jed Brown jed at 59A2.org
Tue Jun 7 14:41:28 CDT 2011


On Tue, Jun 7, 2011 at 21:12, Boyce Griffith <griffith at cims.nyu.edu> wrote:

> In fact, it was the tests of assembled-versus-unassembled performance that
> you described on the libMesh list that led to me trying this out in the
> first place.  My (possibly faulty) reasoning was that if there could be some
> performance advantage to using assembled matrices for low order FE
> discretizations, then perhaps there also might be some for low order FD/FV
> discretizations.  At least in what I was doing, there did not appear to be
> any benefit to doing this, although matrix free is not tons faster either.


For FEM with 3D hex elements, the assembled versus unassembled tradeoff is
shown in the second figure here

https://github.com/jedbrown/dohp/wiki/Dohp

So assembled is good for scalar problems with lowest-order elements. As the
order and number of components per node increases, unassembled becomes
preferable. Note that this exploits a tensor product basis so the interfaces
in libmesh, deal.ii, etc. will have much worse asymptotics as the order is
increased.

For FD, you have an explicit formula for the local residual that does not
involve quadrature and lots of element operations. This makes matrix-free
residuals very cheap.

As perhaps an extreme example, for constant coefficients and a 27-point
stencil, we got 93% of FPU peak in L1 and 71% from memory on Blue Gene/P.
There are technical reasons why sparse mat-vec cannot saturate the memory
bus on BG/P, but even if it could, there is 27*(1.5)/2 = 20 times more data
to move through when you have a matrix.

I don't have a performance number for 27-point assembled, but 7-point
assembled gets 7.0 Mstencil/s on BG/P. Our matrix-free stencil
implementation gets 178 Mstencil/s in L1 and 71 Mstencil/s from memory. Note
that it's much more likely for a given problem size to fit in L1 if there is
no matrix involved, so the in-L1 results are possible. (Our implementation
is not trivially obvious to the most casual observer, but it does
demonstrate what the hardware can do.)

BG/P stencil computations:
http://59a2.org/files/ppc450d_stencil_microkernel.pdf

7-point MatMult numbers from:
http://hpc.sagepub.com/content/early/2010/12/03/1094342010389857.abstract
(http://www.mcs.anl.gov/uploads/cels/papers/P1658.pdf)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110607/e19dad25/attachment.html>


More information about the petsc-dev mailing list