[petsc-users] Question about matrix permutation

Fri Jan 29 15:58:02 CST 2010

On Fri, 29 Jan 2010 14:57:51 -0600, Barry Smith <bsmith at mcs.anl.gov> wrote:
>     This is why I have the 1.5 there instead of 2.5 or 3 or more you  
> might see without inodes.

Okay, so suppose that the ordering is identical.  With no inodes, each
entry effectively costs sizeof double + sizeof int (=12).  With inodes,
it's sizeof double + 1/4*sizeof int (=9).  With BAIJ, it's sizeof double
+ 1/16*sizeof int (=8.25).

If the ordering is different, e.g. [u0,u1,...,v0,v1,...] as seems
popular for unknown reasons, then cache reuse of the vector goes out the
window and it's going to be really bad.

> In additional all the "extra" column entries are still stored in the
> a->j array. Thus when we move to each new set of rows we skip over
> those entries, thus partial cache lines of a->i are constantly
> wasted.

So a nontrivial matrix with bs=4 will have over 100 nonzeros per row,
thus moving to the next block involves skipping 300*sizeof int.  This is
more than 8 lines ahead so it's almost guaranteed to be a completely
cold cache miss at all levels, which means 250 clocks to memory.  During
the block line, we are using around 400 doubles that also need to come
all the way from memory, and cost a bit under 4 clocks each (assuming
fully saturated bus).  So the miss due only to stepping over these
column indices could cost over 15% if everything else was running
smoothly (big assumption, I know).

Add to that the fact that a stream of matrix entries steps over a 4kb
page boundary (which hardware prefetch doesn't cross) four times as
often as with BAIJ where the entries are a single contiguous stream, and
that the hardware prefetcher only follows one stream per 4kB page.  So
with this naive analysis, it seems possible for deficient hardware
prefetch to be at fault for a nontrivial amount of the observed
difference between Inode and BAIJ.

Nobody's sparse mat-vecs are especially close to saturating the memory
bus with useful stuff, even for matrices that are good at reusing the
vector.  So there must be more to the story than pure bandwidth and
reuse of the vector.

Jed