[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Mon Oct 8 16:07:10 CDT 2012

On Sun, Oct 7, 2012 at 9:17 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:

>
> A proper partition should ideally locality-aware. Instead of the standard
> GPU approach
>   for (size_t row = thread_id; row < size; row += thread_num) {...}
> the much better approach on the CPU is to reuse caches (I think this is
> what you referred to as a contiguous partition):
>   for (size_t row = thread_id * row_block_size;
>               row < (thread_id+1) * row_block_size;
>             ++row){...}
>

Yes, but in the case of a stencil operation (or matrix), there is reuse
between neighbors. If we are on BG/Q, there are 4 hardware threads sharing
an L1. Each hardware thread has its own allocation of outstanding memory
requests, so even if you are memory bandwidth limited, you have to use the
threads. In that case, a good strategy is to block by cores, but then
interlace at cache line granularity among the four threads sharing L1.
These would naturally get out of phase, but if each thread prefetches for
its neighbor (my_rank+1)%4, they stay loosely synchronized, thus sharing
bandwidth and L1. You can do the same sharing for L2 and L3 depending on
the memory hierarchy.

>
> I don't think that one wants to explicitly copy 'ghost elements' in the
> vector (which would be *required* with MPI) and just delegate this task to
> the memory controller.
>
>
Yes, but delegating to the memory controller requires that the vector is
_contiguous_ in virtual memory. We have to create the whole thing with one
malloc and don't store separate pointers for each segment. We can keep
track of which memory bus each page should be located (according to
first-touch, can guarantee it using libnuma on Linux).

>
> I agree that explicit copies here are a burden. However, we are not
> 'required' to copy, as the first-touch policy keeps data at the respective
> location anyway. The main benefit of keeping a page record would be a
> better thread control: Assume you have two vacant threads, one on memory
> link 0, the other at memory link 1. If a job is submitted, the two threads
> can start working on data that is close first, and eventually continue to
> operate on data that has been used recently at the thread's cache line. I
> agree that this is rather low-level tinkering, but memory hierarchies are
> not expected to flatten out, neither are memory links expected to increase
> in speed substantially.

We're agreeing. I thought you were advocating explicit segmentation of
vectors within an MPI process, which I think does not provide useful value
and costs significant complexity.

If the matrix bandwidth is not too large, we can aim for similar locality
> benefits, at least when compared to assigning threads randomly to the
> individual subtasks.

Sure, but you either need locks, atomics, or explicitly segmentation to
avoid conflicted writes.

> Can you elaborate on this? It is still a good time to refine our
>> threadcomm model.
>>
>
> Sure. I'll first discuss the theoretical model, then go into some
> practical aspects:
> Each OpenCL runtime defines its own platform, consisting of devices the
> runtime can deal with. If you have SDKs from AMD, NVIDIA, and Intel
> installed, you will see three platforms. The first will be equipped with
> AMD GPUs and x86 CPUs, the second will hold NVIDIA GPUs, and the third will
> hold x86 CPUs only. Within each platform you can create various contexts. A
> context is a collection of devices from the platform and holdes the memory
> buffer. It is important to note here that memory buffers are not explicitly
> assigned to devices, they may rather float around among the devices within
> the context. Each device may be equipped with an arbitrary number of
> command queues. Command queues are per default in-order, but can also be
> operated out-of-order. A job (essentially any operation accessing a buffer)
> needs to be submitted to a command queue. As command queues are linked to
> devices, the command queue also defines the device on which to execute the
> job.
> One may also submit 'events' to a command queue, which can then be used
> for synchronization purposes in order to enforce a certain 'order' in the
> processing of jobs (different command queues are not sync'd). Recently,
> they have also introduced the possibility to split up a device in order to
> reserve resources for e.g. prioritized jobs.
>
> Now for some practical aspects:
> Placing all available devices into a single context can be a bad idea,
> because the kernels are compiled for all devices within a context. Thus, if
> e.g. one device does not support double precision, the whole context does
> not support it. Also, up until OpenCL 1.2 there was no mechanism to
> manually shift buffers to a device in order to compute at some later stage.
> For these reasons, I prefer to associate just one device with each context,
> which effectively yields the CUDA model.
>

Thanks.

>
> Now, comparing this with the current threadcomm, the threadcomm may
> benefit from multiple job queues. I'm thinking about a similar
> device-association, i.e. one queue is dedicated to data in RAM, while other
> queues are dedicated to accelerators.

Hmm, seems like a decision about where the indirection goes. What is the
advantage of routing dispatch to a GPU through the threadcomm? The "kernel
code" is different in each case so it sounds like it would be much harder
to call.

> I don't know right now whether there is a mechanism to specify the desired
> number of workers in threadcomm - if not, this should be added too.
>

My plan was to follow the MPI model and provide comm "split" routines. Note
that this is more precise control than "number of workers" since you can
isolate independent tasks, each of which is parallel, and guarantee that
they run wherever their memory is.

>
> However, with such an extension of threadcomm we should be careful with
> keeping the job scheduler part as orthogonal to the actual threading
> kernels operating on data as possible.
>
>
Hmm, one reason for the current structure was so that we could build fused
operations. For example, we could have

MatMultFused_kernel(PetscInt trank,Mat A,Vec X,Vec Y,Vec Z,Vec W,const
PetscScalar *alpha,PetscThreadReduction dot) {
  MatMult_kernel(trank,A,X,Y);
  VecPointwiseMult_kernel(trank,Y,Y,Z);
  VecAYPX_kernel(trank,Y,alpha,X);
  VecDot_kernel(trank,Y,W,dot);
}

Note that coherent access to X is required to start this operation, but
there are no other dependencies between the threads. This avoids going
through the queue to perform this sequence of operations. It is feature
that the _kernel routines can be called directly from other kernels in
addition to being called by submitting them to the queue.

Can you sketch what you have in mind for keeping scheduling orthogonal from
the kernel implementations? It seems to me that the local code will really
be different, especially for matrix operations like factorization that have
inter-thread data dependencies.

> I agree that a scheduler is bogus for small operations. One may instead
> grab a thread from the pool and let it iterate in some user function with a
> bunch of small operations (now scheduler-free!), thus virtually 'packing'
> operations together.

Our approach to this was to (a) create a threadcomm of size 1 (which could
literally be the control thread) or (b) use the "nothread" implementation
which just calls the function pointer directly. When compiled without
threading, the entire kernel invocation can be inlined all the way to the
point that the kernel code (if defined in the same compilation unit) can
also be inlined.

This is why the VecDot kernel does not return a raw result. You can call
>> VecDot_kernel() collectively from another kernel, but you pass in a
>> PetscThreadReduction and only reduce it when you need the result.
>>
>
> Hmm, right, VecDot was a bad example, as it uses reduction... Anyway, feel
> free to replace VecDot with any other non-collective operation.
>

Okay, but these modify objects. Consider VecAXPY which does not block for
completion. My intent was that _incoherent_ access (access only needed to
the part owned by each thread) such as needed by a subsequent VecDot would
not cause blocking, but _coherent_ vector access (as needed by MatMult)
would block on the last modification. Shri and I talked about this a while
back and we have pseudocode in emails if not in committed code.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121008/2c337a4d/attachment.html>