[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory
Karl Rupp
rupp at mcs.anl.gov
Fri Oct 5 17:50:16 CDT 2012
Dear petsc-dev'ers,
I'll start my undertaking of a common infrastructure for linear algebra
operations with a first look at managing memory. Even though this is
presumably the part with smaller complexity compared to the actual
execution model, there are still a number of subtleties involved. Some
introductory information is also given in order to provide the necessary
context (and to make sure I haven't misinterpreted something).
-- 1. Introduction --
Let's begin with the current datastructure of a Vec (some comments
shortened to make everything fit into one line):
struct _p_Vec {
PETSCHEADER(struct _VecOps);
PetscLayout map;
void *data; /* implementation-specific data */
PetscBool array_gotten;
VecStash stash,bstash; /* storing off-proc values */
PetscBool petscnative; /* ->data: VecGetArrayFast()*/
#if defined(PETSC_HAVE_CUSP)
PetscCUSPFlag valid_GPU_array; /* GPU data up-to-date? */
void *spptr; /* GPU data handler */
#endif
};
In a purely CPU-driven execution, there is a pointer to the data
(*data), which is assumed to reside in a single linear piece of memory
(please correct me if I'm wrong), yet may be managed by some external
routines (VecOps).
As accelerators enter the game (indicated by PETSC_HAVE_CUSP), the
concept of a vector having one pointer to its data is undermined. Now,
Vec can possibly have data on CPU RAM, and on one (multiple with
txpetscgpu) CUDA accelerator. 'valid_GPU_array' indicates which of the
two memory domains holds the most recent data, possibly both.
-- 2. Shortcomings of the Current Model --
First, the additional preprocessor directive for supporting a
dual-memory-domain is a clear sign that this is an add-on to a
single-memory-domain model rather than a well-designed
multi-memory-domain model. If OpenCL support is to be added, one would
end up either infiltrating 'valid_GPU_array' and 'spptr' and thus end up
with a model supporting either OpenCL or CUDA, but not both.
The second subtlety involves the physical location of data. OpenCL and
CUDA provide options for CPU-mapped memory, i.e. the synchronization
logic can be deferred to the respective drivers. Still, one would have
to manage a pair of {CPU pointer; GPU handle} rather than a single
pointer. Also, the current welding of CPU and GPU in AMD's Llano and
Trinity ultimately lead to a single pointer to main RAM for both
portions of the device. Anyhow, a separation of memory handle storage
away from a single pointer *data towards *data and *spptr would clearly
prevent any such unified handling of memory locations.
Third, *spptr is not actually referring to a GPU memory handle, but is
instead pointing to full memory handlers (GPUarray in the single-GPU
case, GPUvector with txpetscgpu). However, such functionality should
better be placed in VecOps rather than out-sourcing all the management
logic via *spptr, particularly as VecOps is intended to accomplish just
that.
-- 3. Proposed Modifications --
I'm proposing to drop the lines
#if defined(PETSC_HAVE_CUSP)
PetscCUSPFlag valid_GPU_array; /* GPU data up-to-date? */
#endif
from the definition of a Vector and similarly for Mat. As *spptr seems
to be in use for other stuff in Mat, one may keep *spptr in Vec for
reasons of uniformity, but keep it unused for accelerator purposes.
As for the handling of data, I suggest an extension of the current data
container, currently defined by
#define VECHEADER \
PetscScalar *array; \
PetscScalar *array_allocated; \
PetscScalar *unplacedarray;
The first option is to use preprocessor magic to inject pointers to
accelerator handles (and appropriate use-flags) one after another
directly into VECHEADER.
However, as nested preprocessor magic is detrimental to code legibility,
I prefer the second option, which is to add a generic pointer to a
struct PetscAcceleratorData.
One is then free to handle all meta information for accelerators in
PetscAcceleratorData and place suitable enabler-#defines therein. The
additional indirection from *data into PetscAcceleratorData is not
problematic for accelerators because of launch overheads in the order of
10 microseconds. Host-based executions such as OpenMP or a thread pool
don't need to access the accelerator handles anyway, as they operate in
main memory provided by *array.
The projected definition of PetscAcceleratorData will be something
similar to
struct PetscAcceleratorData{
#if defined(PETSC_HAVE_CUDA)
PetscCUDAHandleDescriptor *cuda_handles;
PetscInt cuda_handle_num;
#endif
#if defined(PETSC_HAVE_OPENCL)
PetscOpenCLHandleDescriptor *opencl_handles;
PetscInt opencl_handle_num;
#endif
/* etc. */
}
Here, the PetscXYZHandleDescriptor holds
- the memory handle,
- the device ID the handles are valid for, and
- a flag whether the data is valid
(cf. valid_GPU_array, but with a much finer granularity).
Additional metainformation such as index ranges can be extended as
needed, cf. Vec_Seq vs Vec_MPI. Different types Petsc*HandleDescriptors
are expected to be required because the various memory handle types are
not guaranteed to have a particular maximum size among different
accelerator platforms.
At this point I have to admit that a few more implementation details
might show up, yet the proposed model is able to cover the case of
multiple accelerators from different vendors and provides fine-grained
meta-information for each buffer.
Similar modifications would be applied to Mat, where data ultimately
needs to be mapped to linear pieces of memory again for the use in
accelerators.
-- 4. Concluding remarks --
Even though the mere question of how to hold memory handles is certainly
less complex than a full unification of actual operations at runtime,
this first step needs to be done right in order to have a solid
foundation to built on. Thus, if you guys spot any weaknesses in the
proposed modifications, please let me know. I tried to align everything
such that integrates nicely into Petsc, yet I don't know many of the
implementation details yet...
Thanks and best regards,
Karli
PS: The reverse-lookup of the vector initialization routines revealed a
remarkably sophisticated initialization system... Chapeau!
More information about the petsc-dev
mailing list