[petsc-users] Question about matrix permutation
Barry Smith
bsmith at mcs.anl.gov
Fri Jan 29 16:52:51 CST 2010
Jed,
It is possible some times to "turn hardware prefetching off",
possibly with Intel compiler options. Take something like a PETSc
matrix vector product compile with it turned on and turned off. See if
you have a difference in flop rates. I owe you a beer if the
difference is more than say 2 percent.
Barry
On Jan 29, 2010, at 3:58 PM, Jed Brown wrote:
> On Fri, 29 Jan 2010 14:57:51 -0600, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
>> This is why I have the 1.5 there instead of 2.5 or 3 or more you
>> might see without inodes.
>
> Okay, so suppose that the ordering is identical. With no inodes, each
> entry effectively costs sizeof double + sizeof int (=12). With
> inodes,
> it's sizeof double + 1/4*sizeof int (=9). With BAIJ, it's sizeof
> double
> + 1/16*sizeof int (=8.25).
>
> If the ordering is different, e.g. [u0,u1,...,v0,v1,...] as seems
> popular for unknown reasons, then cache reuse of the vector goes out
> the
> window and it's going to be really bad.
>
>> In additional all the "extra" column entries are still stored in the
>> a->j array. Thus when we move to each new set of rows we skip over
>> those entries, thus partial cache lines of a->i are constantly
>> wasted.
>
> So a nontrivial matrix with bs=4 will have over 100 nonzeros per row,
> thus moving to the next block involves skipping 300*sizeof int.
> This is
> more than 8 lines ahead so it's almost guaranteed to be a completely
> cold cache miss at all levels, which means 250 clocks to memory.
> During
> the block line, we are using around 400 doubles that also need to come
> all the way from memory, and cost a bit under 4 clocks each (assuming
> fully saturated bus). So the miss due only to stepping over these
> column indices could cost over 15% if everything else was running
> smoothly (big assumption, I know).
>
> Add to that the fact that a stream of matrix entries steps over a 4kb
> page boundary (which hardware prefetch doesn't cross) four times as
> often as with BAIJ where the entries are a single contiguous stream,
> and
> that the hardware prefetcher only follows one stream per 4kB page. So
> with this naive analysis, it seems possible for deficient hardware
> prefetch to be at fault for a nontrivial amount of the observed
> difference between Inode and BAIJ.
>
> Nobody's sparse mat-vecs are especially close to saturating the memory
> bus with useful stuff, even for matrices that are good at reusing the
> vector. So there must be more to the story than pure bandwidth and
> reuse of the vector.
>
> Jed
More information about the petsc-users
mailing list