<div dir="ltr">Could you elaborate a bit on what you mean by packing aligned representations at some granularity? I thought this was what the AOSOA configuration does: packing in variables at the aligned SIMD width. Do you mean loop blocking with each block fitting into the L1 cache?<br>

</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sat, Nov 23, 2013 at 3:48 PM, Jed Brown <span dir="ltr"><<a href="mailto:jedbrown@mcs.anl.gov" target="_blank">jedbrown@mcs.anl.gov</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">Mani Chandra <<a href="mailto:mc0710@gmail.com">mc0710@gmail.com</a>> writes:<br>

<br>

> Hi,<br>

><br>

> Is it possible to use an Arrays of Structs of Arrays (AOSOA) configuration<br>

> using DMDAs? Something like<br>

><br>

> struct node {<br>

>   float var1[16], var2[16], var3[16];<br>

> }<br>

<br>

</div>Yes, you can manually manage this dimension/chunking, and use<br>

DMDASetBlockFills() so that the resulting matrix retains proper<br>

sparsity.  Neighbor exchange will not automatically understand the<br>

blocks, and you would have to use a different fringe layout if you want<br>

to organize data as AoSoA.<br>

<div class="im"><br>

> Instead of<br>

><br>

> struct node {<br>

>   float var1, var2, var3;<br>

> }<br>

><br>

> as is the usual way of using DMDAs.<br>

><br>

> The global grid size of say a 2D grid would then decrease from NxN to (N/16)xN<br>

><br>

> I'm interested in doing this for ease of vectorization as described in<br>

> <a href="http://software.intel.com/en-us/articles/memory-layout-transformations" target="_blank">http://software.intel.com/en-us/articles/memory-layout-transformations</a><br>

<br>

</div>Note that sparse iterative methods are overwhelmingly limited by memory<br>

bandwidth rather than vectorization, so you'll get no speedup here.<br>

Heavy optimization of stencil operations requires either unaligned loads<br>

or a "roll" operation, at which point the benefit over register<br>

transposition fades.  So instead of trying to change the global memory<br>

alignment, I recommend packing aligned representations at whichever<br>

granularity makes sense (in registers, in L1-cache tiles, etc).  Make<br>

sure to benchmark the real memory access patterns before leaping to<br>

conclusions about optimal memory layout.<br>

</blockquote></div><br></div>