[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory
Karl Rupp
rupp at mcs.anl.gov
Sat Oct 6 08:51:27 CDT 2012
Hi Jed,
>
> In a purely CPU-driven execution, there is a pointer to the data
> (*data), which is assumed to reside in a single linear piece of
> memory (please correct me if I'm wrong), yet may be managed by some
> external routines (VecOps).
>
>
> I think you're familiar with this, but I'll comment anyway:
>
> All major PETSc classes use a Delegator design. In Vec, *data points to
> the implementation-specific structure. The "native" types start with
> VECHEADER, of which the first member is PetscScalar *array.
> Distinguishing "native" from other types is simply an optimization
> because some people like to call VecGetArray in inner loops. The normal
> PETSc model is to do an indirect call (vec->ops->getarray) for anything
> that may be implemented differently by different implementations. The
> "native" flag allows the full implementation to be inlined, that's all.
>
> Arguably, people should write their code to hoist the VecGetArray out of
> the inner-most loop so that the handful of cycles for the indirect call
> would not be noticeable.
Yes, I've noticed that in VecGetArray. Thanks for the explicit
confirmation that it indeed works this way. :-)
>
> As accelerators enter the game (indicated by PETSC_HAVE_CUSP), the
> concept of a vector having one pointer to its data is undermined.
> Now, Vec can possibly have data on CPU RAM, and on one (multiple
> with txpetscgpu) CUDA accelerator. 'valid_GPU_array' indicates which
> of the two memory domains holds the most recent data, possibly both.
>
>
> We are just implementing lazy update. Note that spptr only points to the
> derived type's structure. The array itself is held inside that structure
> and only accessed through that type-specialized interface. (Recall that
> VECSEQCUSP inherits from VECSEQ.)
Hmm, I thought that spptr is some 'special pointer' as commented in Mat,
but not supposed to be a generic pointer to a derived class'
datastructure (spptr is only injected with #define PETSC_HAVE_CUSP).
>
> With different devices, we could simply have a valid flag for each
> device. When someone does VecDevice1GetArrayWrite(), the flag for all
> other devices is marked invalid. When VecDevice2GetArrayRead() is
> called, the implementation copies from any valid device to device2.
> Packing all those flags as bits in a single int is perhaps convenient,
> but not necessary.
I think that the most common way of handling GPUs will be an overlapping
decomposition of the host array, similar to how a vector is distributed
via MPI (locally owned, writeable, vs ghost values with read-only).
Assigning the full vector exclusively to just one device is more a
single-GPU scenario rather than a multi-GPU use case.
>
> (...)
> The projected definition of PetscAcceleratorData will be something
> similar to
> struct PetscAcceleratorData{
> #if defined(PETSC_HAVE_CUDA)
> PetscCUDAHandleDescriptor *cuda_handles;
> PetscInt cuda_handle_num;
> #endif
> #if defined(PETSC_HAVE_OPENCL)
> PetscOpenCLHandleDescriptor *opencl_handles;
> PetscInt opencl_handle_num;
> #endif
> /* etc. */
> }
>
>
> I think this stuff (which allows for segmenting the array on the device)
> can go in Vec_CUDA and Vec_OpenCL, basically just replacing the GPUarray
> member of Vec_CUSP. Why have a different PetscAcceleratorData struct?
If spptr is intended to be a generic pointer to data of the derived
class, then this is also a possiblity. However, this would lead to
Vec_CUDA, Vec_OpenCL, and Vec_CUDA_OpenCL, with the number of
implementations rapidly increasing as one may eventually add other
frameworks. The PetscAcceleratorData would essentially allow for a
unification of Vec_CUDA, Vec_OpenCL, and Vec_CUDA_OpenCL, avoiding code
duplication problems.
>
> Here, the PetscXYZHandleDescriptor holds
> - the memory handle,
> - the device ID the handles are valid for, and
> - a flag whether the data is valid
> (cf. valid_GPU_array, but with a much finer granularity).
> Additional metainformation such as index ranges can be extended as
> needed, cf. Vec_Seq vs Vec_MPI. Different types
> Petsc*HandleDescriptors are expected to be required because the
> various memory handle types are not guaranteed to have a particular
> maximum size among different accelerator platforms.
>
>
> It sounds like you want to support marking only part of an array as
> stale. We could could keep one top-level (_p_Vec) flag indicating
> whether the CPU part was current, then in the specific implementation
> (Vec_OpenCL), you can hold finer granularity. Then when
> vec->ops->UpdateCPUArray() is called, you can look at the finer
> granularity flags to copy only what needs to be copied.
>
Yes, I also thought of such a top-level-flag. This is, however, rather
an optimization flag (similar to what is done in VecGetArray for
petscnative), so I refrained from a separate discussion.
Aside from that, yes, I want to support parts of an array as stale, as
the best multi-GPU use I've experienced so far is for block-based
preconditioners (cf. Block-ILU-variants, parallel AMG flavors, etc.). A
multi-GPU sparse matrix-vector product is handled similarly.
Thanks for the input and best regards,
Karli
More information about the petsc-dev
mailing list