[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Fri Oct 5 17:50:16 CDT 2012

Dear petsc-dev'ers,

I'll start my undertaking of a common infrastructure for linear algebra 
operations with a first look at managing memory. Even though this is 
presumably the part with smaller complexity compared to the actual 
execution model, there are still a number of subtleties involved. Some 
introductory information is also given in order to provide the necessary 
context (and to make sure I haven't misinterpreted something).

-- 1. Introduction --

Let's begin with the current datastructure of a Vec (some comments 
shortened to make everything fit into one line):

struct _p_Vec {
   PETSCHEADER(struct _VecOps);
   PetscLayout            map;
   void                   *data;     /* implementation-specific data */
   PetscBool              array_gotten;
   VecStash               stash,bstash; /* storing off-proc values */
   PetscBool              petscnative;  /* ->data: VecGetArrayFast()*/
#if defined(PETSC_HAVE_CUSP)
   PetscCUSPFlag          valid_GPU_array; /* GPU data up-to-date? */
   void                   *spptr;          /* GPU data handler */
#endif
};

In a purely CPU-driven execution, there is a pointer to the data 
(*data), which is assumed to reside in a single linear piece of memory 
(please correct me if I'm wrong), yet may be managed by some external 
routines (VecOps).

As accelerators enter the game (indicated by PETSC_HAVE_CUSP), the 
concept of a vector having one pointer to its data is undermined. Now, 
Vec can possibly have data on CPU RAM, and on one (multiple with 
txpetscgpu) CUDA accelerator. 'valid_GPU_array' indicates which of the 
two memory domains holds the most recent data, possibly both.

-- 2. Shortcomings of the Current Model --

First, the additional preprocessor directive for supporting a 
dual-memory-domain is a clear sign that this is an add-on to a 
single-memory-domain model rather than a well-designed 
multi-memory-domain model. If OpenCL support is to be added, one would 
end up either infiltrating 'valid_GPU_array' and 'spptr' and thus end up 
with a model supporting either OpenCL or CUDA, but not both.

The second subtlety involves the physical location of data. OpenCL and 
CUDA provide options for CPU-mapped memory, i.e. the synchronization 
logic can be deferred to the respective drivers. Still, one would have 
to manage a pair of {CPU pointer; GPU handle} rather than a single 
pointer. Also, the current welding of CPU and GPU in AMD's Llano and 
Trinity ultimately lead to a single pointer to main RAM for both 
portions of the device. Anyhow, a separation of memory handle storage 
away from a single pointer *data towards *data and *spptr would clearly 
prevent any such unified handling of memory locations.

Third, *spptr is not actually referring to a GPU memory handle, but is 
instead pointing to full memory handlers (GPUarray in the single-GPU 
case, GPUvector with txpetscgpu). However, such functionality should 
better be placed in VecOps rather than out-sourcing all the management 
logic via *spptr, particularly as VecOps is intended to accomplish just 
that.

-- 3. Proposed Modifications --

I'm proposing to drop the lines
  #if defined(PETSC_HAVE_CUSP)
   PetscCUSPFlag          valid_GPU_array; /* GPU data up-to-date? */
  #endif
from the definition of a Vector and similarly for Mat. As *spptr seems 
to be in use for other stuff in Mat, one may keep *spptr in Vec for 
reasons of uniformity, but keep it unused for accelerator purposes.

As for the handling of data, I suggest an extension of the current data 
container, currently defined by

#define VECHEADER                          \
   PetscScalar *array;                      \
   PetscScalar *array_allocated;            \
   PetscScalar *unplacedarray;

The first option is to use preprocessor magic to inject pointers to 
accelerator handles (and appropriate use-flags) one after another 
directly into VECHEADER.

However, as nested preprocessor magic is detrimental to code legibility, 
I prefer the second option, which is to add a generic pointer to a 
struct PetscAcceleratorData.
One is then free to handle all meta information for accelerators in 
PetscAcceleratorData and place suitable enabler-#defines therein. The 
additional indirection from *data into PetscAcceleratorData is not 
problematic for accelerators because of launch overheads in the order of 
10 microseconds. Host-based executions such as OpenMP or a thread pool 
don't need to access the accelerator handles anyway, as they operate in 
main memory provided by *array.

The projected definition of PetscAcceleratorData will be something 
similar to
  struct PetscAcceleratorData{
#if defined(PETSC_HAVE_CUDA)
    PetscCUDAHandleDescriptor    *cuda_handles;
    PetscInt                     cuda_handle_num;
#endif
#if defined(PETSC_HAVE_OPENCL)
    PetscOpenCLHandleDescriptor  *opencl_handles;
    PetscInt                     opencl_handle_num;
#endif
    /* etc. */
  }

Here, the PetscXYZHandleDescriptor holds
  - the memory handle,
  - the device ID the handles are valid for, and
  - a flag whether the data is valid
    (cf. valid_GPU_array, but with a much finer granularity).
Additional metainformation such as index ranges can be extended as 
needed, cf. Vec_Seq vs Vec_MPI. Different types Petsc*HandleDescriptors 
are expected to be required because the various memory handle types are 
not guaranteed to have a particular maximum size among different 
accelerator platforms.

At this point I have to admit that a few more implementation details 
might show up, yet the proposed model is able to cover the case of 
multiple accelerators from different vendors and provides fine-grained 
meta-information for each buffer.

Similar modifications would be applied to Mat, where data ultimately 
needs to be mapped to linear pieces of memory again for the use in 
accelerators.

-- 4. Concluding remarks --

Even though the mere question of how to hold memory handles is certainly 
less complex than a full unification of actual operations at runtime, 
this first step needs to be done right in order to have a solid 
foundation to built on. Thus, if you guys spot any weaknesses in the 
proposed modifications, please let me know. I tried to align everything 
such that integrates nicely into Petsc, yet I don't know many of the 
implementation details yet...

Thanks and best regards,
Karli

PS: The reverse-lookup of the vector initialization routines revealed a 
remarkably sophisticated initialization system... Chapeau!