<div class="gmail_quote">On Fri, Nov 25, 2011 at 09:02, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div id=":8yx">This is a continuation of the thread started by Matt stating he wanted to use "vectorization" as the low level common kernels and I asked him for syntax (for example CUDA, OpenCL, ....) You are talking about something completely different and not relevant for that thread.<br>

</div></blockquote></div><br><div>Okay, I misunderstood your question, but I think both steps are relevant. The distribution we want between threads isn't computed with the same methodology as a distribution between processes (certainly not in the case of CUDA). I think we will always have some sort of pattern of applying a transformation on the data (which may or may not involve network communication), applying a vectorized operation, and reducing the result. I don't have concrete suggestions for syntax or even primitives for this multi-level case, but I think the communication can probably be described by adding a partition or coloring to the CPU-based communication ideas we were discussing in the other thread.</div>