[petsc-users] Question about matrix permutation

Sun Jan 31 21:34:21 CST 2010

   Move this to petsc-dev at mcs.anl.gov

On Jan 31, 2010, at 4:44 PM, Jed Brown wrote:

> On Sun, 31 Jan 2010 16:02:17 -0600, Barry Smith <bsmith at mcs.anl.gov>  
> wrote:
>>   Check config/PETSc/Configure.py configurePrefix() for the current
>> support in PETSc. It is used, for example, in src/mat/impls/sbaij/ 
>> seq/
>> relax.h; you may have found better ways of doing this so feel free to
>> change the current support if you have something better.
>
> Aha, but PETSC_Prefetch (naming violation?)

    Yes, it uses the config addDefine() method that prepends the silly  
PETSC_, I was to lazy to figure out a nicer way to do it.

> just wraps
> __builtin_prefetch or _mm_prefetch, which means it's only fetching one
> cache line of the preceding row.  From what I can tell, it's bad to
> software prefetch part of a row and rely on hardware prefetch to  
> pick up
> the rest.  If you're using software prefetch for a particular access
> pattern, you want to ask for exactly what you need, which normally  
> means
> calling prefetch more than once.  The Intel optimization manual says
> that you should overlap the prefetch calls with computation (because  
> the
> prefetch instruction occupies the same execution unit as loads) but
> since we're so far beyond actually being CPU bound, I don't think  
> there
> is a real penalty to issuing several prefetch instructions at once and
> then going to work on the block that should already be in cache while
> the prefetch results trickle in.

    Just doing that one gave me a surprising large jump in performance  
so I didn't purse it further.  As you may have guessed I'm pretty  
pessimistic about the whole business :-(

>
> Also, I'm pretty sure we want to be using _MM_HINT_NTA (0) instead of
> _MM_HINT_T2 (1) since the latter brings the values into all levels of
> cache.  Since it's the next row that we'll work with, it's very rare
> (i.e. only matrices with an absurd number of nonzeros per row) that  
> the
> row would actually be evicted from L1 before we get to it.  We don't
> want to pollute the higher levels because we want as much as possible
> for the vector.  (At least, using _MM_HINT_T2 took several percent off
> the MatMult flops, and I would anticipate the same effect for MatSolve
> and MatSOR.)

    I didn't have a clue what to use for this option.
>
> Do we currently detect the cache line size anywhere?

    Nope, another job for ./configure

    Barry

>  I don't know how
> to do that kind of thing portably, though this claims to do it
>
>  http://www.open-mpi.org/projects/hwloc/
>
>
> Jed