[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Sat Oct 6 09:09:07 CDT 2012

On Sat, Oct 6, 2012 at 8:51 AM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Hmm, I thought that spptr is some 'special pointer' as commented in Mat,
> but not supposed to be a generic pointer to a derived class' datastructure
> (spptr is only injected with #define PETSC_HAVE_CUSP).
>

Look at cuspvecimpl.h, for example.

#undef __FUNCT__
#define __FUNCT__ "VecCUSPGetArrayReadWrite"
PETSC_STATIC_INLINE PetscErrorCode VecCUSPGetArrayReadWrite(Vec v,
CUSPARRAY** a)
{
  PetscErrorCode ierr;

  PetscFunctionBegin;
  *a   = 0;
  ierr = VecCUSPCopyToGPU(v);CHKERRQ(ierr);
  *a   = ((Vec_CUSP *)v->spptr)->GPUarray;
  PetscFunctionReturn(0);
}

Vec is following the convention from Mat where spptr points to the
Mat_UMFPACK, Mat_SuperLU, Mat_SeqBSTRM, etc, which hold the extra
information needed for that derived class (note that "derivation" for Mat
is typically done at run time, after the object has been created, due to a
call to MatGetFactor).

>
>> With different devices, we could simply have a valid flag for each
>> device. When someone does VecDevice1GetArrayWrite(), the flag for all
>> other devices is marked invalid. When VecDevice2GetArrayRead() is
>> called, the implementation copies from any valid device to device2.
>> Packing all those flags as bits in a single int is perhaps convenient,
>> but not necessary.
>>
>
> I think that the most common way of handling GPUs will be an overlapping
> decomposition of the host array, similar to how a vector is distributed via
> MPI (locally owned, writeable, vs ghost values with read-only). Assigning
> the full vector exclusively to just one device is more a single-GPU
> scenario rather than a multi-GPU use case.
>

Okay, the matrix will have to partition itself. What is the advantage of
having a single CPU process addressing multiple GPUs? Why not use different
MPI processes? (We can have the MPI processes sharing a node create a
subcomm so they can decide which process is driving which device.)

> I think this stuff (which allows for segmenting the array on the device)
>> can go in Vec_CUDA and Vec_OpenCL, basically just replacing the GPUarray
>> member of Vec_CUSP. Why have a different PetscAcceleratorData struct?
>>
>
> If spptr is intended to be a generic pointer to data of the derived class,
> then this is also a possiblity. However, this would lead to
> Vec_CUDA, Vec_OpenCL, and Vec_CUDA_OpenCL, with the number of
> implementations rapidly increasing as one may eventually add other
> frameworks. The PetscAcceleratorData would essentially allow for a
> unification of Vec_CUDA, Vec_OpenCL, and Vec_CUDA_OpenCL, avoiding code
> duplication problems.
>

How would the user decide which device they wanted computation to run on?
(Also, Is OpenCL really the right name in an environment where there may be
multiple devices using OpenCL?) Currently, the type indicates where native
operations should "prefer" to compute, copying data there when necessary.
The Vec operations have different implementations for CUDA and OpenCL so I
don't see the problem with making them different derived classes. If we
wanted a hybrid CUDA/OpenCL class, it would contain the logic for deciding
where to do things followed by dispatch into the device-specific
implementation, thus it doesn't seem like duplication to me.

>
>
>
>
>>     Here, the PetscXYZHandleDescriptor holds
>>       - the memory handle,
>>       - the device ID the handles are valid for, and
>>       - a flag whether the data is valid
>>         (cf. valid_GPU_array, but with a much finer granularity).
>>     Additional metainformation such as index ranges can be extended as
>>     needed, cf. Vec_Seq vs Vec_MPI. Different types
>>     Petsc*HandleDescriptors are expected to be required because the
>>     various memory handle types are not guaranteed to have a particular
>>     maximum size among different accelerator platforms.
>>
>>
>> It sounds like you want to support marking only part of an array as
>> stale. We could could keep one top-level (_p_Vec) flag indicating
>> whether the CPU part was current, then in the specific implementation
>> (Vec_OpenCL), you can hold finer granularity. Then when
>> vec->ops->UpdateCPUArray() is called, you can look at the finer
>> granularity flags to copy only what needs to be copied.
>>
>>
> Yes, I also thought of such a top-level-flag. This is, however, rather an
> optimization flag (similar to what is done in VecGetArray for petscnative),
> so I refrained from a separate discussion.
>
> Aside from that, yes, I want to support parts of an array as stale, as the
> best multi-GPU use I've experienced so far is for block-based
> preconditioners (cf. Block-ILU-variants, parallel AMG flavors, etc.). A
> multi-GPU sparse matrix-vector product is handled similarly.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121006/aa6f58db/attachment.html>