[petsc-users] GAMG scaling

Mon Dec 24 23:24:37 CST 2018

On Tue, Dec 25, 2018 at 12:10 AM Jed Brown <jed at jedbrown.org> wrote:

> Mark Adams <mfadams at lbl.gov> writes:
>
> > On Mon, Dec 24, 2018 at 4:56 PM Jed Brown <jed at jedbrown.org> wrote:
> >
> >> Mark Adams via petsc-users <petsc-users at mcs.anl.gov> writes:
> >>
> >> > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private
> in
> >> > attached, NB, this is code that I wrote in grad school). It is memory
> >> > efficient and simple, just four nested loops i,j,I,J: C(I,J) =
> >> > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data
> that I
> >> am
> >> > getting from my bone modeling colleagues, that use this old code on
> >> > Stampede now, the times look reasonable compared to GAMG. This is
> >> optimized
> >> > for elasticity, where I unroll loops (so it is really six nested
> loops).
> >>
> >> Is the A above meant to include some ghosted rows?
> >>
> >
> > You could but I was thinking of having i in the outer loop. In C(I,J) =
> > P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows
> of
> > A (and the left term P).
>
> Okay, so you need to gather those rows of P referenced by the
> off-diagonal parts of A.

yes, and this looks correct ..

> Once you have them, do
>
>   for i:
>     v[:] = 0 # sparse vector
>     for j:
>       v[:] += A[i,j] * P[j,:]
>     for I:
>       C[I,:] += P[i,I] * v[:]
>
> One inefficiency is that you don't actually get "hits" on all the
> entries of C[I,:], but that much remains no matter how you reorder loops
> (unless you make I the outermost).
>

> >> > In thinking about this now, I think you want to make a local copy of P
> >> with
> >> > rows (j) for every column in A that you have locally, then transpose
> this
> >> > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a
> >> > special tree data structure but a matrix is simpler.)
> >>
> >> Why transpose for P(j,J)?
> >>
> >
> > (premature) optimization. I was thinking 'j' being in the inner loop and
> > doing sparse inner product, but now that I think about it there are other
> > options.
>
> Sparse inner products tend to be quite inefficient.  Explicit blocking
> helps some, but I would try to avoid it.
>

Yea, the design space here is non-trivial.

BTW, I have a Cal ME grad student that I've been working with on getting my
old parallel FE / Prometheus code running on Stampede for his bone modeling
problems. He started from zero in HPC but he is eager and has been picking
it up. If there is interest we could get performance data with the existing
code, as a benchmark, and we could generate matrices, if anyone wants to
look into this.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20181225/9c0158c6/attachment.html>