[petsc-users] Question about matrix permutation

Fri Jan 29 16:52:51 CST 2010

    Jed,

     It is possible some times to "turn hardware prefetching off",  
possibly with Intel compiler options. Take something like a PETSc  
matrix vector product compile with it turned on and turned off. See if  
you have a difference in flop rates. I owe you a beer if the  
difference is more than say 2 percent.

    Barry

On Jan 29, 2010, at 3:58 PM, Jed Brown wrote:

> On Fri, 29 Jan 2010 14:57:51 -0600, Barry Smith <bsmith at mcs.anl.gov>  
> wrote:
>>    This is why I have the 1.5 there instead of 2.5 or 3 or more you
>> might see without inodes.
>
> Okay, so suppose that the ordering is identical.  With no inodes, each
> entry effectively costs sizeof double + sizeof int (=12).  With  
> inodes,
> it's sizeof double + 1/4*sizeof int (=9).  With BAIJ, it's sizeof  
> double
> + 1/16*sizeof int (=8.25).
>
> If the ordering is different, e.g. [u0,u1,...,v0,v1,...] as seems
> popular for unknown reasons, then cache reuse of the vector goes out  
> the
> window and it's going to be really bad.
>
>> In additional all the "extra" column entries are still stored in the
>> a->j array. Thus when we move to each new set of rows we skip over
>> those entries, thus partial cache lines of a->i are constantly
>> wasted.
>
> So a nontrivial matrix with bs=4 will have over 100 nonzeros per row,
> thus moving to the next block involves skipping 300*sizeof int.   
> This is
> more than 8 lines ahead so it's almost guaranteed to be a completely
> cold cache miss at all levels, which means 250 clocks to memory.   
> During
> the block line, we are using around 400 doubles that also need to come
> all the way from memory, and cost a bit under 4 clocks each (assuming
> fully saturated bus).  So the miss due only to stepping over these
> column indices could cost over 15% if everything else was running
> smoothly (big assumption, I know).
>
> Add to that the fact that a stream of matrix entries steps over a 4kb
> page boundary (which hardware prefetch doesn't cross) four times as
> often as with BAIJ where the entries are a single contiguous stream,  
> and
> that the hardware prefetcher only follows one stream per 4kB page.  So
> with this naive analysis, it seems possible for deficient hardware
> prefetch to be at fault for a nontrivial amount of the observed
> difference between Inode and BAIJ.
>
> Nobody's sparse mat-vecs are especially close to saturating the memory
> bus with useful stuff, even for matrices that are good at reusing the
> vector.  So there must be more to the story than pure bandwidth and
> reuse of the vector.
>
> Jed