[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Fri Oct 5 21:31:05 CDT 2012

On Fri, Oct 5, 2012 at 5:50 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Dear petsc-dev'ers,
>
> I'll start my undertaking of a common infrastructure for linear algebra
> operations with a first look at managing memory. Even though this is
> presumably the part with smaller complexity compared to the actual
> execution model, there are still a number of subtleties involved. Some
> introductory information is also given in order to provide the necessary
> context (and to make sure I haven't misinterpreted something).
>
> -- 1. Introduction --
>
> Let's begin with the current datastructure of a Vec (some comments
> shortened to make everything fit into one line):
>
> struct _p_Vec {
>   PETSCHEADER(struct _VecOps);
>   PetscLayout            map;
>   void                   *data;     /* implementation-specific data */
>   PetscBool              array_gotten;
>   VecStash               stash,bstash; /* storing off-proc values */
>   PetscBool              petscnative;  /* ->data: VecGetArrayFast()*/
> #if defined(PETSC_HAVE_CUSP)
>   PetscCUSPFlag          valid_GPU_array; /* GPU data up-to-date? */
>   void                   *spptr;          /* GPU data handler */
> #endif
> };
>
> In a purely CPU-driven execution, there is a pointer to the data (*data),
> which is assumed to reside in a single linear piece of memory (please
> correct me if I'm wrong), yet may be managed by some external routines
> (VecOps).
>

I think you're familiar with this, but I'll comment anyway:

All major PETSc classes use a Delegator design. In Vec, *data points to the
implementation-specific structure. The "native" types start with VECHEADER,
of which the first member is PetscScalar *array. Distinguishing "native"
from other types is simply an optimization because some people like to call
VecGetArray in inner loops. The normal PETSc model is to do an indirect
call (vec->ops->getarray) for anything that may be implemented differently
by different implementations. The "native" flag allows the full
implementation to be inlined, that's all.

Arguably, people should write their code to hoist the VecGetArray out of
the inner-most loop so that the handful of cycles for the indirect call
would not be noticeable.

>
> As accelerators enter the game (indicated by PETSC_HAVE_CUSP), the concept
> of a vector having one pointer to its data is undermined. Now, Vec can
> possibly have data on CPU RAM, and on one (multiple with txpetscgpu) CUDA
> accelerator. 'valid_GPU_array' indicates which of the two memory domains
> holds the most recent data, possibly both.
>

We are just implementing lazy update. Note that spptr only points to the
derived type's structure. The array itself is held inside that structure
and only accessed through that type-specialized interface. (Recall that
VECSEQCUSP inherits from VECSEQ.)

With different devices, we could simply have a valid flag for each device.
When someone does VecDevice1GetArrayWrite(), the flag for all other devices
is marked invalid. When VecDevice2GetArrayRead() is called, the
implementation copies from any valid device to device2. Packing all those
flags as bits in a single int is perhaps convenient, but not necessary.

>
> -- 2. Shortcomings of the Current Model --
>
> First, the additional preprocessor directive for supporting a
> dual-memory-domain is a clear sign that this is an add-on to a
> single-memory-domain model rather than a well-designed multi-memory-domain
> model. If OpenCL support is to be added, one would end up either
> infiltrating 'valid_GPU_array' and 'spptr' and thus end up with a model
> supporting either OpenCL or CUDA, but not both.
>
> The second subtlety involves the physical location of data. OpenCL and
> CUDA provide options for CPU-mapped memory, i.e. the synchronization logic
> can be deferred to the respective drivers. Still, one would have to manage
> a pair of {CPU pointer; GPU handle} rather than a single pointer. Also, the
> current welding of CPU and GPU in AMD's Llano and Trinity ultimately lead
> to a single pointer to main RAM for both portions of the device. Anyhow, a
> separation of memory handle storage away from a single pointer *data
> towards *data and *spptr would clearly prevent any such unified handling of
> memory locations.
>
> Third, *spptr is not actually referring to a GPU memory handle, but is
> instead pointing to full memory handlers (GPUarray in the single-GPU case,
> GPUvector with txpetscgpu). However, such functionality should better be
> placed in VecOps rather than out-sourcing all the management logic via
> *spptr, particularly as VecOps is intended to accomplish just that.
>
>
> -- 3. Proposed Modifications --
>
> I'm proposing to drop the lines
>  #if defined(PETSC_HAVE_CUSP)
>   PetscCUSPFlag          valid_GPU_array; /* GPU data up-to-date? */
>  #endif
> from the definition of a Vector and similarly for Mat. As *spptr seems to
> be in use for other stuff in Mat, one may keep *spptr in Vec for reasons of
> uniformity, but keep it unused for accelerator purposes.
>
> As for the handling of data, I suggest an extension of the current data
> container, currently defined by
>
> #define VECHEADER                          \
>   PetscScalar *array;                      \
>   PetscScalar *array_allocated;            \
>   PetscScalar *unplacedarray;
>
> The first option is to use preprocessor magic to inject pointers to
> accelerator handles (and appropriate use-flags) one after another directly
> into VECHEADER.
>
> However, as nested preprocessor magic is detrimental to code legibility, I
> prefer the second option, which is to add a generic pointer to a struct
> PetscAcceleratorData.
> One is then free to handle all meta information for accelerators in
> PetscAcceleratorData and place suitable enabler-#defines therein. The
> additional indirection from *data into PetscAcceleratorData is not
> problematic for accelerators because of launch overheads in the order of 10
> microseconds. Host-based executions such as OpenMP or a thread pool don't
> need to access the accelerator handles anyway, as they operate in main
> memory provided by *array.
>
> The projected definition of PetscAcceleratorData will be something similar
> to
>  struct PetscAcceleratorData{
> #if defined(PETSC_HAVE_CUDA)
>    PetscCUDAHandleDescriptor    *cuda_handles;
>    PetscInt                     cuda_handle_num;
> #endif
> #if defined(PETSC_HAVE_OPENCL)
>    PetscOpenCLHandleDescriptor  *opencl_handles;
>    PetscInt                     opencl_handle_num;
> #endif
>    /* etc. */
>  }
>

I think this stuff (which allows for segmenting the array on the device)
can go in Vec_CUDA and Vec_OpenCL, basically just replacing the GPUarray
member of Vec_CUSP. Why have a different PetscAcceleratorData struct?

>
> Here, the PetscXYZHandleDescriptor holds
>  - the memory handle,
>  - the device ID the handles are valid for, and
>  - a flag whether the data is valid
>    (cf. valid_GPU_array, but with a much finer granularity).
> Additional metainformation such as index ranges can be extended as needed,
> cf. Vec_Seq vs Vec_MPI. Different types Petsc*HandleDescriptors are
> expected to be required because the various memory handle types are not
> guaranteed to have a particular maximum size among different accelerator
> platforms.
>

It sounds like you want to support marking only part of an array as stale.
We could could keep one top-level (_p_Vec) flag indicating whether the CPU
part was current, then in the specific implementation (Vec_OpenCL), you can
hold finer granularity. Then when vec->ops->UpdateCPUArray() is called, you
can look at the finer granularity flags to copy only what needs to be
copied.

>
> At this point I have to admit that a few more implementation details might
> show up, yet the proposed model is able to cover the case of multiple
> accelerators from different vendors and provides fine-grained
> meta-information for each buffer.
>
> Similar modifications would be applied to Mat, where data ultimately needs
> to be mapped to linear pieces of memory again for the use in accelerators.
>
>
> -- 4. Concluding remarks --
>
> Even though the mere question of how to hold memory handles is certainly
> less complex than a full unification of actual operations at runtime, this
> first step needs to be done right in order to have a solid foundation to
> built on. Thus, if you guys spot any weaknesses in the proposed
> modifications, please let me know. I tried to align everything such that
> integrates nicely into Petsc, yet I don't know many of the implementation
> details yet...
>
>
> Thanks and best regards,
> Karli
>
>
> PS: The reverse-lookup of the vector initialization routines revealed a
> remarkably sophisticated initialization system... Chapeau!
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121005/ed14328c/attachment.html>