[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Sat Oct 6 08:51:27 CDT 2012

Hi Jed,

 >
>     In a purely CPU-driven execution, there is a pointer to the data
>     (*data), which is assumed to reside in a single linear piece of
>     memory (please correct me if I'm wrong), yet may be managed by some
>     external routines (VecOps).
>
>
> I think you're familiar with this, but I'll comment anyway:
>
> All major PETSc classes use a Delegator design. In Vec, *data points to
> the implementation-specific structure. The "native" types start with
> VECHEADER, of which the first member is PetscScalar *array.
> Distinguishing "native" from other types is simply an optimization
> because some people like to call VecGetArray in inner loops. The normal
> PETSc model is to do an indirect call (vec->ops->getarray) for anything
> that may be implemented differently by different implementations. The
> "native" flag allows the full implementation to be inlined, that's all.
>
> Arguably, people should write their code to hoist the VecGetArray out of
> the inner-most loop so that the handful of cycles for the indirect call
> would not be noticeable.

Yes, I've noticed that in VecGetArray. Thanks for the explicit 
confirmation that it indeed works this way. :-)

>
>     As accelerators enter the game (indicated by PETSC_HAVE_CUSP), the
>     concept of a vector having one pointer to its data is undermined.
>     Now, Vec can possibly have data on CPU RAM, and on one (multiple
>     with txpetscgpu) CUDA accelerator. 'valid_GPU_array' indicates which
>     of the two memory domains holds the most recent data, possibly both.
>
>
> We are just implementing lazy update. Note that spptr only points to the
> derived type's structure. The array itself is held inside that structure
> and only accessed through that type-specialized interface. (Recall that
> VECSEQCUSP inherits from VECSEQ.)

Hmm, I thought that spptr is some 'special pointer' as commented in Mat, 
but not supposed to be a generic pointer to a derived class' 
datastructure (spptr is only injected with #define PETSC_HAVE_CUSP).

>
> With different devices, we could simply have a valid flag for each
> device. When someone does VecDevice1GetArrayWrite(), the flag for all
> other devices is marked invalid. When VecDevice2GetArrayRead() is
> called, the implementation copies from any valid device to device2.
> Packing all those flags as bits in a single int is perhaps convenient,
> but not necessary.

I think that the most common way of handling GPUs will be an overlapping 
decomposition of the host array, similar to how a vector is distributed 
via MPI (locally owned, writeable, vs ghost values with read-only). 
Assigning the full vector exclusively to just one device is more a 
single-GPU scenario rather than a multi-GPU use case.

>
> (...)
>     The projected definition of PetscAcceleratorData will be something
>     similar to
>       struct PetscAcceleratorData{
>     #if defined(PETSC_HAVE_CUDA)
>         PetscCUDAHandleDescriptor    *cuda_handles;
>         PetscInt                     cuda_handle_num;
>     #endif
>     #if defined(PETSC_HAVE_OPENCL)
>         PetscOpenCLHandleDescriptor  *opencl_handles;
>         PetscInt                     opencl_handle_num;
>     #endif
>         /* etc. */
>       }
>
>
> I think this stuff (which allows for segmenting the array on the device)
> can go in Vec_CUDA and Vec_OpenCL, basically just replacing the GPUarray
> member of Vec_CUSP. Why have a different PetscAcceleratorData struct?

If spptr is intended to be a generic pointer to data of the derived 
class, then this is also a possiblity. However, this would lead to
Vec_CUDA, Vec_OpenCL, and Vec_CUDA_OpenCL, with the number of 
implementations rapidly increasing as one may eventually add other 
frameworks. The PetscAcceleratorData would essentially allow for a 
unification of Vec_CUDA, Vec_OpenCL, and Vec_CUDA_OpenCL, avoiding code 
duplication problems.

>
>     Here, the PetscXYZHandleDescriptor holds
>       - the memory handle,
>       - the device ID the handles are valid for, and
>       - a flag whether the data is valid
>         (cf. valid_GPU_array, but with a much finer granularity).
>     Additional metainformation such as index ranges can be extended as
>     needed, cf. Vec_Seq vs Vec_MPI. Different types
>     Petsc*HandleDescriptors are expected to be required because the
>     various memory handle types are not guaranteed to have a particular
>     maximum size among different accelerator platforms.
>
>
> It sounds like you want to support marking only part of an array as
> stale. We could could keep one top-level (_p_Vec) flag indicating
> whether the CPU part was current, then in the specific implementation
> (Vec_OpenCL), you can hold finer granularity. Then when
> vec->ops->UpdateCPUArray() is called, you can look at the finer
> granularity flags to copy only what needs to be copied.
>

Yes, I also thought of such a top-level-flag. This is, however, rather 
an optimization flag (similar to what is done in VecGetArray for 
petscnative), so I refrained from a separate discussion.

Aside from that, yes, I want to support parts of an array as stale, as 
the best multi-GPU use I've experienced so far is for block-based 
preconditioners (cf. Block-ILU-variants, parallel AMG flavors, etc.). A 
multi-GPU sparse matrix-vector product is handled similarly.

Thanks for the input and best regards,
Karli