[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Sun Oct 7 21:17:27 CDT 2012

Hi,

On 10/06/2012 10:01 PM, Jed Brown wrote:
>     Still, one may 'fake' a std::deque by holding meta-information about
>     the physical memory pages nevertheless, yet allocate a (virtually)
>     linear piece of memory to keep compatibility with XYZGetArray().
>     This would allow some nice optimizations for the threading
>     scheduler, as threads may operate on 'nearer' pages first.
>
>
> So we sort of have this in the affinity associated with the thread
> communicator. We expect first-touch semantics (I wish we could use
> Linux's libnuma to be more explicit) so the page mapping can be computed
> using the page size and affinity.
>
> In my opinion, the critical issue of separate versus unified memory
> arises in kernels like MatMult. If we partition the rows of the matrix
> among threads (I don't want to assume a contiguous partition because I
> want to share cache, but this argument works anyway), we generally need
> access to parts of the vector that are not in our local segment. In
> general, we may need to access some entries all segments of the vector.
> If we are not going to expose vectors via contiguous storage, we will be
> obligated to partition our matrices into explicit column segments, add a
> level of indirection to vector access, or explicitly copy the ghost
> parts into thread-local buffers.

A proper partition should ideally locality-aware. Instead of the 
standard GPU approach
   for (size_t row = thread_id; row < size; row += thread_num) {...}
the much better approach on the CPU is to reuse caches (I think this is 
what you referred to as a contiguous partition):
   for (size_t row = thread_id * row_block_size;
               row < (thread_id+1) * row_block_size;
             ++row){...}
Here it is assumed that threads the index difference between threads is 
a measure for the 'distance', and that the matrix is reordered such that 
it has small bandwidth. With two CPU sockets and two SMT cores each one 
could use the numbering:
  Socket 0: Core 0: Threads: {0,1}
            Core 1: Threads: {2,3}
  Socket 1: Core 0: Threads: {4,5}
            Core 1: Threads: {6,7}
Similar schemes apply for higher core counts and four-socket systems.

I don't think that one wants to explicitly copy 'ghost elements' in the 
vector (which would be *required* with MPI) and just delegate this task 
to the memory controller.

> I don't like any of these options because of implementation complexity
> and because the ghost regions are much larger than the interiors as we
> strong scale. Note that the principle reason for using threads in the
> first place was to avoid needing these copies. If we were going to
> always copy into local buffers, we could use an MPI-type model all the
> way down. The defining feature of threads is that they can use a common
> address space.

I agree that explicit copies here are a burden. However, we are not 
'required' to copy, as the first-touch policy keeps data at the 
respective location anyway. The main benefit of keeping a page record 
would be a better thread control: Assume you have two vacant threads, 
one on memory link 0, the other at memory link 1. If a job is submitted, 
the two threads can start working on data that is close first, and 
eventually continue to operate on data that has been used recently at 
the thread's cache line. I agree that this is rather low-level 
tinkering, but memory hierarchies are not expected to flatten out, 
neither are memory links expected to increase in speed substantially.

> MatMultTranspose is somewhat more complicated since the output space is
> overlapping.

If the matrix bandwidth is not too large, we can aim for similar 
locality benefits, at least when compared to assigning threads randomly 
to the individual subtasks.

>
>     Btw: Shri's threading communicator is almost identical to the OpenCL
>     model (with the latter having a few additional capabilities).
>
>
> Can you elaborate on this? It is still a good time to refine our
> threadcomm model.

Sure. I'll first discuss the theoretical model, then go into some 
practical aspects:
Each OpenCL runtime defines its own platform, consisting of devices the 
runtime can deal with. If you have SDKs from AMD, NVIDIA, and Intel 
installed, you will see three platforms. The first will be equipped with 
AMD GPUs and x86 CPUs, the second will hold NVIDIA GPUs, and the third 
will hold x86 CPUs only. Within each platform you can create various 
contexts. A context is a collection of devices from the platform and 
holdes the memory buffer. It is important to note here that memory 
buffers are not explicitly assigned to devices, they may rather float 
around among the devices within the context. Each device may be equipped 
with an arbitrary number of command queues. Command queues are per 
default in-order, but can also be operated out-of-order. A job 
(essentially any operation accessing a buffer) needs to be submitted to 
a command queue. As command queues are linked to devices, the command 
queue also defines the device on which to execute the job.
One may also submit 'events' to a command queue, which can then be used 
for synchronization purposes in order to enforce a certain 'order' in 
the processing of jobs (different command queues are not sync'd). 
Recently, they have also introduced the possibility to split up a device 
in order to reserve resources for e.g. prioritized jobs.

Now for some practical aspects:
Placing all available devices into a single context can be a bad idea, 
because the kernels are compiled for all devices within a context. Thus, 
if e.g. one device does not support double precision, the whole context 
does not support it. Also, up until OpenCL 1.2 there was no mechanism to 
manually shift buffers to a device in order to compute at some later 
stage. For these reasons, I prefer to associate just one device with 
each context, which effectively yields the CUDA model.

Now, comparing this with the current threadcomm, the threadcomm may 
benefit from multiple job queues. I'm thinking about a similar 
device-association, i.e. one queue is dedicated to data in RAM, while 
other queues are dedicated to accelerators. I don't know right now 
whether there is a mechanism to specify the desired number of workers in 
threadcomm - if not, this should be added too.

However, with such an extension of threadcomm we should be careful with 
keeping the job scheduler part as orthogonal to the actual threading 
kernels operating on data as possible.

>
>     A bit of a spoiler for the actual job runtime (more brainstorming
>     than complete suggestions):
>     I can imagine submitting the Vec, the IS, and the type of job to the
>     scheduler, possibly including some hints on the type of operations
>     to follow. One may even enforce a certain type of device here, even
>     though this requires the scheduler to move the data in place first.
>     In this way one can perform smaller tasks on the respective CPU core
>     (if we keep track of affinity information), and offload larger tasks
>     to an available accelerator if possible. (Note that this is the main
>     reason why I don't want to hide buffers in library-specific derived
>     classes of Vec). The scheduler can use simple heuristics on where to
>     perform operations based on typical latencies (e.g. ~20us for a GPU
>     kernel)
>
>
> Where does the scheduler go in this approach? Some users have a TS with
> a small number of degrees of freedom (e.g. reduced chemistry), so we
> want to keep "nothread" kernel launch down in the range similar to an
> indirect function call.

For such small systems it is probably even too expensive to hand the job 
over to another thread. Otherwise, one may explicitly assign a command 
queue to a particular thread.

> I think it would be good for the GPU subtype to have a threshold and
> fall back to a CPU implementation for small enough sizes, but I'm
> reluctant to have a scheduler looking at all Vec implementations.

I agree that a scheduler is bogus for small operations. One may instead 
grab a thread from the pool and let it iterate in some user function 
with a bunch of small operations (now scheduler-free!), thus virtually 
'packing' operations together.

>     Yes, we could/can. A single kernel launcher also allows for fusing
>     kernels, e.g. matrix-vector-product followed by an inner product of
>     the result vector with some other vector. As outlined above,
>     asynchronous data movement could even be the default rather than the
>     exception, except for cases where one gives control over the data to
>     the outside by e.g. returning a pointer to the array. In such cases
>     one would have first wait for all operations on all data to finish.
>
>     The main concern in all that is the readiness of the user. Awareness
>     for asynchronous operations keeps rising, yet I can imagine user
>     code like
>
>       PetscScalar * data = VecXYZGetArray(v1); // flushes the queue suitably
>       data[0] = VecDot(v2, v3);                // enqueues VecDot
>       PetscScalar s = data[0];                 // VecDot may not be
>     finished!
>
>     where a pointer given away once undermines everything.
>
>
> This is why the VecDot kernel does not return a raw result. You can call
> VecDot_kernel() collectively from another kernel, but you pass in a
> PetscThreadReduction and only reduce it when you need the result.

Hmm, right, VecDot was a bad example, as it uses reduction... Anyway, 
feel free to replace VecDot with any other non-collective operation.

Best regards,
Karli