[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Fri Oct 5 21:22:04 CDT 2012

On Fri, Oct 5, 2012 at 6:50 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Dear petsc-dev'ers,
>
> I'll start my undertaking of a common infrastructure for linear algebra
> operations with a first look at managing memory. Even though this is
> presumably the part with smaller complexity compared to the actual
> execution model, there are still a number of subtleties involved. Some
> introductory information is also given in order to provide the necessary
> context (and to make sure I haven't misinterpreted something).
>
> -- 1. Introduction --
>
> Let's begin with the current datastructure of a Vec (some comments
> shortened to make everything fit into one line):
>
> struct _p_Vec {
>   PETSCHEADER(struct _VecOps);
>   PetscLayout            map;
>   void                   *data;     /* implementation-specific data */
>   PetscBool              array_gotten;
>   VecStash               stash,bstash; /* storing off-proc values */
>   PetscBool              petscnative;  /* ->data: VecGetArrayFast()*/
> #if defined(PETSC_HAVE_CUSP)
>   PetscCUSPFlag          valid_GPU_array; /* GPU data up-to-date? */
>   void                   *spptr;          /* GPU data handler */
> #endif
> };
>
> In a purely CPU-driven execution, there is a pointer to the data (*data),
> which is assumed to reside in a single linear piece of memory (please
> correct me if I'm wrong), yet may be managed by some external routines
> (VecOps).
>

No, the 'data' is actually a pointer to the implementation class (it is
helpful to compare this to other class headers, which all have
the data pointer). In this case, it would be Vec_Seq or Vec_MPI

http://petsc.cs.iit.edu/petsc/petsc-dev/annotate/0b92fc173218/src/vec/vec/impls/dvecimpl.h#l14

In fact is VECHEADER that has the array:

http://petsc.cs.iit.edu/petsc/petsc-dev/annotate/0b92fc173218/include/petsc-private/vecimpl.h#l435

Jed started the practice of linking to code, and I think its the bees
knees. You are correct that all these implementations
assume a piece of linear memory on the CPU. On the GPU, we synchronize some
linear memory with Cusp vectors.

As accelerators enter the game (indicated by PETSC_HAVE_CUSP), the concept
> of a vector having one pointer to its data is undermined. Now, Vec can
> possibly have data on CPU RAM, and on one (multiple with txpetscgpu) CUDA
> accelerator. 'valid_GPU_array' indicates which of the two memory domains
> holds the most recent data, possibly both.
>

There is an implementation of PETSc Vecs with non-contiguous memory for
SAMRAI.

> -- 2. Shortcomings of the Current Model --
>
> First, the additional preprocessor directive for supporting a
> dual-memory-domain is a clear sign that this is an add-on to a
> single-memory-domain model rather than a well-designed multi-memory-domain
> model. If OpenCL support is to be added, one would end up either
> infiltrating 'valid_GPU_array' and 'spptr' and thus end up with a model
> supporting either OpenCL or CUDA, but not both.
>
> The second subtlety involves the physical location of data. OpenCL and
> CUDA provide options for CPU-mapped memory, i.e. the synchronization logic
> can be deferred to the respective drivers. Still, one would have to manage
> a pair of {CPU pointer; GPU handle} rather than a single pointer. Also, the
> current welding of CPU and GPU in AMD's Llano and Trinity ultimately lead
> to a single pointer to main RAM for both portions of the device. Anyhow, a
> separation of memory handle storage away from a single pointer *data
> towards *data and *spptr would clearly prevent any such unified handling of
> memory locations.
>
> Third, *spptr is not actually referring to a GPU memory handle, but is
> instead pointing to full memory handlers (GPUarray in the single-GPU case,
> GPUvector with txpetscgpu). However, such functionality should better be
> placed in VecOps rather than out-sourcing all the management logic via
> *spptr, particularly as VecOps is intended to accomplish just that.
>
>
> -- 3. Proposed Modifications --
>
> I'm proposing to drop the lines
>  #if defined(PETSC_HAVE_CUSP)
>   PetscCUSPFlag          valid_GPU_array; /* GPU data up-to-date? */
>  #endif
> from the definition of a Vector and similarly for Mat. As *spptr seems to
> be in use for other stuff in Mat, one may keep *spptr in Vec for reasons of
> uniformity, but keep it unused for accelerator purposes.
>
> As for the handling of data, I suggest an extension of the current data
> container, currently defined by
>
> #define VECHEADER                          \
>   PetscScalar *array;                      \
>   PetscScalar *array_allocated;            \
>   PetscScalar *unplacedarray;
>
> The first option is to use preprocessor magic to inject pointers to
> accelerator handles (and appropriate use-flags) one after another directly
> into VECHEADER.
>
> However, as nested preprocessor magic is detrimental to code legibility, I
> prefer the second option, which is to add a generic pointer to a struct
> PetscAcceleratorData.
> One is then free to handle all meta information for accelerators in
> PetscAcceleratorData and place suitable enabler-#defines therein. The
> additional indirection from *data into PetscAcceleratorData is not
> problematic for accelerators because of launch overheads in the order of 10
> microseconds. Host-based executions such as OpenMP or a thread pool don't
> need to access the accelerator handles anyway, as they operate in main
> memory provided by *array.
>
> The projected definition of PetscAcceleratorData will be something similar
> to
>  struct PetscAcceleratorData{
> #if defined(PETSC_HAVE_CUDA)
>    PetscCUDAHandleDescriptor    *cuda_handles;
>    PetscInt                     cuda_handle_num;
> #endif
> #if defined(PETSC_HAVE_OPENCL)
>    PetscOpenCLHandleDescriptor  *opencl_handles;
>    PetscInt                     opencl_handle_num;
> #endif
>    /* etc. */
>  }
>
> Here, the PetscXYZHandleDescriptor holds
>  - the memory handle,
>  - the device ID the handles are valid for, and
>  - a flag whether the data is valid
>    (cf. valid_GPU_array, but with a much finer granularity).
> Additional metainformation such as index ranges can be extended as needed,
> cf. Vec_Seq vs Vec_MPI. Different types Petsc*HandleDescriptors are
> expected to be required because the various memory handle types are not
> guaranteed to have a particular maximum size among different accelerator
> platforms.
>
> At this point I have to admit that a few more implementation details might
> show up, yet the proposed model is able to cover the case of multiple
> accelerators from different vendors and provides fine-grained
> meta-information for each buffer.
>
> Similar modifications would be applied to Mat, where data ultimately needs
> to be mapped to linear pieces of memory again for the use in accelerators.
>
>
> -- 4. Concluding remarks --
>
> Even though the mere question of how to hold memory handles is certainly
> less complex than a full unification of actual operations at runtime, this
> first step needs to be done right in order to have a solid foundation to
> built on. Thus, if you guys spot any weaknesses in the proposed
> modifications, please let me know. I tried to align everything such that
> integrates nicely into Petsc, yet I don't know many of the implementation
> details yet...
>

I can't tell from the above how we would synchronize memory. Perhaps it
would be easy to show with an example
of how this would work, as opposed to the current system.

   Matt

> Thanks and best regards,
> Karli
>
>
> PS: The reverse-lookup of the vector initialization routines revealed a
> remarkably sophisticated initialization system... Chapeau!
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121005/66b4f7a5/attachment.html>