[petsc-users] PETSc finite difference performance question (calculating gradient square)

Wed Jun 19 00:58:19 CDT 2013

Yonghui <lyh03259.aps at gmail.com> writes:

> Hi PETSC users and developers,
>
>  
>
> I am new to PETSC and I just want to have a communication with users and
> developers for a question I am concern. All I am concern is the performance
> of calculating Nabla_square in a parallel program with PETSC implemented.
>
>  
>
> For example, say one have a 3 dimension function: f(x,y,z). We want nabla^2
> f. i.e. (d^2/dx^2+ d^2/dy^2+ d^2/dz^2)f(x,y,z). (Make problem simpler
> assuming d^2/dxdy*f=0.)
>
> But on a computer, what we can do is usually finite difference, Thus in
> order to calculate the nabla^2*f on point x, y, z, we need all its nearest
> neighbors.
>
> That is to say, at leaste we need x+1, x-1, y+1, y-1, z+1 and z -1 (assuming
> 2nd order central).
>
> In the RAM, all number are stored in a ONE dimensional array. So the number
> is stored like this: ..f(x-1, y-1,z-1), f(x, y-1,z-1), f(x+1, y-1,z-1),..
> f(x-1, y,z-1), f(x, y,z-1), f(x+1, y,z-1),.. ,f(x-1, y+1,z-1), f(x,
> y+1,z-1), f(x+1, y+1,z-1),..
>
> So in order to work it out, the first thing is to pick out the numbers
> needed for the calculation. So each time one need to load some numbers in to
> the RAM, and pick out 1 or 2 or 3 out and discard the rest of them.. Until
> all the numbers one need are prepared for the calculation.
>
> It is clear that some there is a huge wasting here: the number that needed
> for the point cannot be loaded for immediately. Thus, the bottle neck is not
> the computing time but the loading time.

I'm not sure if you argument here is about latency or bandwidth, but
it's true that for sparse operations, memory access is almost always the
bottleneck.

> Q: So how does PETSc library handle such kind of problem? Could you please
> explain it to me, if you understand how does it happened?

Some kernels have software prefetch to reduce the unpredicted cache
misses and so that upon eviction from L1, only the useful stuff remains
in higher level caches.  Bandwidth is the main limitation in sparse
computation and cannot be avoided by clever tricks unless the data
structure is changed entirely.  See the second figure here for an
example of how dramatic such transformations can be:

  https://github.com/jedbrown/dohp/wiki/Dohp

But if you go high-order and matrix-free like in that figure, you still
have to deal with preconditioning somewhere.  Unless you have the
expertise and motivation to write custom matrix-free preconditioners
(like some multigrid methods), then preconditioners will be constructed
From assembled matrices, and the bandwidth issues will apply.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130619/2199cf81/attachment.pgp>