[petsc-dev] Maintaining row alignment in matrices (especially Inode and odd-sized (S)BAIJ)

Wed Oct 6 20:02:25 CDT 2010

On Oct 6, 2010, at 7:56 PM, Aron Ahmadia wrote:

> Thanks Barry and Jed,
> 
> This makes sense.
> 
> On a slightly separate note.  Barry, can we always guarantee (or at
> least forbid the users from breaking) no-aliasing between PETSc
> vectors and matrices?  I know matmult and matmultadd forbid aliased
> vectors, but nothing in PETSc prevents you from doing something silly
> like stuffing the same buffer address into multiple vectors.

   No, but users rarely set that space themselves anyways.

   If you want to be paranoid you can have the asserts that check that the vectors (in for example MatMult) also check that the buffers in the vectors are different. Trivial to add.

  For example,
  if (x == y) SETERRQ(((PetscObject)mat)->comm,PETSC_ERR_ARG_WRONGSTATE,"x and y must be different vectors");
  if (*((PetscScalar **)x->data) == *((PetscScalar **)y->data) SETERR/.

but wrap it up in a nice macro.

   Barry

> 
> A
> 
> On Wed, Oct 6, 2010 at 8:50 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>> 
>>   Make a whole new subclass of SeqAIJ (parallel to the Inode) that does all this cool stuff and copies into the new aligned data structures (rather than keeping the data in the same data structure (as the current inode does).  We'll just have to get the factorization stuff to work eventually once you show good performance gain for MatMult_SeqAIJ_AlignedInode().
>> 
>> 
>>   Barry
>> 
>> On Oct 6, 2010, at 7:04 PM, Jed Brown wrote:
>> 
>>> Looking at assembly generated from the Inode kernels, I see that it does not use packed instructions within the blocks.  I tried both gcc-4.5.1 and icc-11.0.081 at -O3, the latter took 3 minutes 40 seconds to compile inode.c, but neither generated packed instructions.  Aron and John (cc'd) see similar effects on Blue Gene.  The reason for this is that the input arrays may not be aligned, and most of the packed instructions (except movups/d) require 16-byte alignment, the situation is similar on BG.  The code size to check and dispatch to a kernel that makes only valid alignment assumptions would be enormous, so the compiler does not do it.
>>> 
>>> This is not a huge deal on x86-64 since the operation is mostly memory limited anyway, but it would be nice to have the ability to specify an alignment to be guaranteed at the beginning of each row.  The situation is quite different on Blue Gene where peak bandwidth can only be obtained with (aligned) 16-byte loads into the packed registers.  Also, Intel/AMD will add AVX next year which has 32-byte packed registers.  So it would be good if the matrix kernels could support alignment constraints on the row starts (padding out odd row lengths).
>>> 
>>> I think it should be a runtime option rather than compiled in because, e.g. a 5-point stencil would need to be padded out to 8 with single precision or with double+AVX, and a 9-point stencil would be padded to 16 with single+AVX.  A simulation that solved a light 2D problem coupled to a heavy 3D problem (maybe on a smaller domain, or with less stiff time scales) would suffer from having the choice compiled in.
>>> 
>>> The Inode kernels could then be specialized for aligned row starts and regular row lengths.  I could outfit an aligned MatMult_SeqAIJ_Inode with SSE kernels in under an hour, so I don't think that is a huge time investment.  Aron and John are looking at sparse kernels on Blue Gene where alignment is perhaps more important, it sounds like they would be able to contribute a couple Blue Gene kernels.
>>> 
>>> I think it's also straightforward on the allocation front, but I don't know if it would be complicated to make the factorization kernels handle the padding.  Are there deep assumptions about unpadded that would be difficult to remove?
>>> 
>>> Jed
>> 
>>