On Sun, Oct 7, 2012 at 9:17 PM, Karl Rupp <span dir="ltr"><<a href="mailto:rupp@mcs.anl.gov" target="_blank">rupp@mcs.anl.gov</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im"><br></div>

A proper partition should ideally locality-aware. Instead of the standard GPU approach<br>

  for (size_t row = thread_id; row < size; row += thread_num) {...}<br>

the much better approach on the CPU is to reuse caches (I think this is what you referred to as a contiguous partition):<br>

  for (size_t row = thread_id * row_block_size;<br>

              row < (thread_id+1) * row_block_size;<br>

            ++row){...}<br></blockquote><div><br></div><div>Yes, but in the case of a stencil operation (or matrix), there is reuse between neighbors. If we are on BG/Q, there are 4 hardware threads sharing an L1. Each hardware thread has its own allocation of outstanding memory requests, so even if you are memory bandwidth limited, you have to use the threads. In that case, a good strategy is to block by cores, but then interlace at cache line granularity among the four threads sharing L1. These would naturally get out of phase, but if each thread prefetches for its neighbor (my_rank+1)%4, they stay loosely synchronized, thus sharing bandwidth and L1. You can do the same sharing for L2 and L3 depending on the memory hierarchy.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

I don't think that one wants to explicitly copy 'ghost elements' in the vector (which would be *required* with MPI) and just delegate this task to the memory controller.<div class="im"><br></div></blockquote>

<div><br></div><div>Yes, but delegating to the memory controller requires that the vector is _contiguous_ in virtual memory. We have to create the whole thing with one malloc and don't store separate pointers for each segment. We can keep track of which memory bus each page should be located (according to first-touch, can guarantee it using libnuma on Linux).</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


</blockquote>

<br></div>

I agree that explicit copies here are a burden. However, we are not 'required' to copy, as the first-touch policy keeps data at the respective location anyway. The main benefit of keeping a page record would be a better thread control: Assume you have two vacant threads, one on memory link 0, the other at memory link 1. If a job is submitted, the two threads can start working on data that is close first, and eventually continue to operate on data that has been used recently at the thread's cache line. I agree that this is rather low-level tinkering, but memory hierarchies are not expected to flatten out, neither are memory links expected to increase in speed substantially.</blockquote>

<div><br></div><div>We're agreeing. I thought you were advocating explicit segmentation of vectors within an MPI process, which I think does not provide useful value and costs significant complexity.</div><div><br></div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

If the matrix bandwidth is not too large, we can aim for similar locality benefits, at least when compared to assigning threads randomly to the individual subtasks.</blockquote><div><br></div><div>Sure, but you either need locks, atomics, or explicitly segmentation to avoid conflicted writes.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Can you elaborate on this? It is still a good time to refine our<br>

threadcomm model.<br>

</blockquote>

<br></div>

Sure. I'll first discuss the theoretical model, then go into some practical aspects:<br>

Each OpenCL runtime defines its own platform, consisting of devices the runtime can deal with. If you have SDKs from AMD, NVIDIA, and Intel installed, you will see three platforms. The first will be equipped with AMD GPUs and x86 CPUs, the second will hold NVIDIA GPUs, and the third will hold x86 CPUs only. Within each platform you can create various contexts. A context is a collection of devices from the platform and holdes the memory buffer. It is important to note here that memory buffers are not explicitly assigned to devices, they may rather float around among the devices within the context. Each device may be equipped with an arbitrary number of command queues. Command queues are per default in-order, but can also be operated out-of-order. A job (essentially any operation accessing a buffer) needs to be submitted to a command queue. As command queues are linked to devices, the command queue also defines the device on which to execute the job.<br>


One may also submit 'events' to a command queue, which can then be used for synchronization purposes in order to enforce a certain 'order' in the processing of jobs (different command queues are not sync'd). Recently, they have also introduced the possibility to split up a device in order to reserve resources for e.g. prioritized jobs.<br>


<br>

Now for some practical aspects:<br>

Placing all available devices into a single context can be a bad idea, because the kernels are compiled for all devices within a context. Thus, if e.g. one device does not support double precision, the whole context does not support it. Also, up until OpenCL 1.2 there was no mechanism to manually shift buffers to a device in order to compute at some later stage. For these reasons, I prefer to associate just one device with each context, which effectively yields the CUDA model.<br>

</blockquote><div><br></div><div>Thanks.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Now, comparing this with the current threadcomm, the threadcomm may benefit from multiple job queues. I'm thinking about a similar device-association, i.e. one queue is dedicated to data in RAM, while other queues are dedicated to accelerators. </blockquote>

<div><br></div><div>Hmm, seems like a decision about where the indirection goes. What is the advantage of routing dispatch to a GPU through the threadcomm? The "kernel code" is different in each case so it sounds like it would be much harder to call.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I don't know right now whether there is a mechanism to specify the desired number of workers in threadcomm - if not, this should be added too.<br>

</blockquote><div><br></div><div>My plan was to follow the MPI model and provide comm "split" routines. Note that this is more precise control than "number of workers" since you can isolate independent tasks, each of which is parallel, and guarantee that they run wherever their memory is.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

However, with such an extension of threadcomm we should be careful with keeping the job scheduler part as orthogonal to the actual threading kernels operating on data as possible.<div class="im"><br></div></blockquote><div>

 </div><div>Hmm, one reason for the current structure was so that we could build fused operations. For example, we could have</div><div><br></div><div>MatMultFused_kernel(PetscInt trank,Mat A,Vec X,Vec Y,Vec Z,Vec W,const PetscScalar *alpha,PetscThreadReduction dot) {</div>

<div>  MatMult_kernel(trank,A,X,Y);</div><div>  VecPointwiseMult_kernel(trank,Y,Y,Z);</div><div>  VecAYPX_kernel(trank,Y,alpha,X);</div><div>  VecDot_kernel(trank,Y,W,dot);</div><div>}</div><div><br></div><div>Note that coherent access to X is required to start this operation, but there are no other dependencies between the threads. This avoids going through the queue to perform this sequence of operations. It is feature that the _kernel routines can be called directly from other kernels in addition to being called by submitting them to the queue.</div>

<div><br></div><div><br></div><div>Can you sketch what you have in mind for keeping scheduling orthogonal from the kernel implementations? It seems to me that the local code will really be different, especially for matrix operations like factorization that have inter-thread data dependencies.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"></div>

I agree that a scheduler is bogus for small operations. One may instead grab a thread from the pool and let it iterate in some user function with a bunch of small operations (now scheduler-free!), thus virtually 'packing' operations together.</blockquote>

<div><br></div><div>Our approach to this was to (a) create a threadcomm of size 1 (which could literally be the control thread) or (b) use the "nothread" implementation which just calls the function pointer directly. When compiled without threading, the entire kernel invocation can be inlined all the way to the point that the kernel code (if defined in the same compilation unit) can also be inlined.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


This is why the VecDot kernel does not return a raw result. You can call<br>

VecDot_kernel() collectively from another kernel, but you pass in a<br>

PetscThreadReduction and only reduce it when you need the result.<br>

</blockquote>

<br></div>

Hmm, right, VecDot was a bad example, as it uses reduction... Anyway, feel free to replace VecDot with any other non-collective operation.<br></blockquote><div><br>Okay, but these modify objects. Consider VecAXPY which does not block for completion. My intent was that _incoherent_ access (access only needed to the part owned by each thread) such as needed by a subsequent VecDot would not cause blocking, but _coherent_ vector access (as needed by MatMult) would block on the last modification. Shri and I talked about this a while back and we have pseudocode in emails if not in committed code.</div>

</div>