[petsc-dev] Maintaining row alignment in matrices (especially Inode and odd-sized (S)BAIJ)

Wed Oct 6 19:04:43 CDT 2010

Looking at assembly generated from the Inode kernels, I see that it does not
use packed instructions within the blocks.  I tried both gcc-4.5.1 and
icc-11.0.081 at -O3, the latter took 3 minutes 40 seconds to compile
inode.c, but neither generated packed instructions.  Aron and John (cc'd)
see similar effects on Blue Gene.  The reason for this is that the input
arrays may not be aligned, and most of the packed instructions (except
movups/d) require 16-byte alignment, the situation is similar on BG.  The
code size to check and dispatch to a kernel that makes only valid alignment
assumptions would be enormous, so the compiler does not do it.

This is not a huge deal on x86-64 since the operation is mostly memory
limited anyway, but it would be nice to have the ability to specify an
alignment to be guaranteed at the beginning of each row.  The situation is
quite different on Blue Gene where peak bandwidth can only be obtained with
(aligned) 16-byte loads into the packed registers.  Also, Intel/AMD will add
AVX next year which has 32-byte packed registers.  So it would be good if
the matrix kernels could support alignment constraints on the row starts
(padding out odd row lengths).

I think it should be a runtime option rather than compiled in because, e.g.
a 5-point stencil would need to be padded out to 8 with single precision or
with double+AVX, and a 9-point stencil would be padded to 16 with
single+AVX.  A simulation that solved a light 2D problem coupled to a heavy
3D problem (maybe on a smaller domain, or with less stiff time scales) would
suffer from having the choice compiled in.

The Inode kernels could then be specialized for aligned row starts and
regular row lengths.  I could outfit an aligned MatMult_SeqAIJ_Inode with
SSE kernels in under an hour, so I don't think that is a huge time
investment.  Aron and John are looking at sparse kernels on Blue Gene where
alignment is perhaps more important, it sounds like they would be able to
contribute a couple Blue Gene kernels.

I think it's also straightforward on the allocation front, but I don't know
if it would be complicated to make the factorization kernels handle the
padding.  Are there deep assumptions about unpadded that would be difficult
to remove?

Jed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20101007/94ae6a68/attachment.html>