[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Karl Rupp rupp at mcs.anl.gov
Sat Oct 6 20:15:34 CDT 2012


Hi again,


>     Hmm, I thought that spptr is some 'special pointer' as commented in
>     Mat, but not supposed to be a generic pointer to a derived class'
>     datastructure (spptr is only injected with #define PETSC_HAVE_CUSP).
>
>
> Look at cuspvecimpl.h, for example.
>
> #undef __FUNCT__
> #define __FUNCT__ "VecCUSPGetArrayReadWrite"
> PETSC_STATIC_INLINE PetscErrorCode VecCUSPGetArrayReadWrite(Vec v,
> CUSPARRAY** a)
> {
>    PetscErrorCode ierr;
>
>    PetscFunctionBegin;
>    *a   = 0;
>    ierr = VecCUSPCopyToGPU(v);CHKERRQ(ierr);
>    *a   = ((Vec_CUSP *)v->spptr)->GPUarray;
>    PetscFunctionReturn(0);
> }
>
> Vec is following the convention from Mat where spptr points to the
> Mat_UMFPACK, Mat_SuperLU, Mat_SeqBSTRM, etc, which hold the extra
> information needed for that derived class (note that "derivation" for
> Mat is typically done at run time, after the object has been created,
> due to a call to MatGetFactor).

Yes, I've looked at this code prior to my opening email. I'm doubt that 
it is a good idea to *enforce* that handles are hidden somewhere behind 
spptr, because the memory is then bound to a particular library. Compare 
this with the CPU memory allocated by PETSc in *data: One is in 
principle free to pass the raw memory pointer to external libraries and 
let them operate on it. There can be library-specific data hidden behind 
*spptr if required, but I'd prefer to have the raw memory for the 
representation of the vector itself available at *data.

>
>         With different devices, we could simply have a valid flag for each
>         device. When someone does VecDevice1GetArrayWrite(), the flag
>         for all
>         other devices is marked invalid. When VecDevice2GetArrayRead() is
>         called, the implementation copies from any valid device to device2.
>         Packing all those flags as bits in a single int is perhaps
>         convenient,
>         but not necessary.
>
>
>     I think that the most common way of handling GPUs will be an
>     overlapping decomposition of the host array, similar to how a vector
>     is distributed via MPI (locally owned, writeable, vs ghost values
>     with read-only). Assigning the full vector exclusively to just one
>     device is more a single-GPU scenario rather than a multi-GPU use case.
>
>
> Okay, the matrix will have to partition itself. What is the advantage of
> having a single CPU process addressing multiple GPUs? Why not use
> different MPI processes? (We can have the MPI processes sharing a node
> create a subcomm so they can decide which process is driving which device.)

Making MPI a prerequisite for multi-GPU usage would be a unnecessary 
restriction, wouldn't it?


>
> How would the user decide which device they wanted computation to run
> on? (Also, Is OpenCL really the right name in an environment where there
> may be multiple devices using OpenCL?)

That probably leaves the 'memory' scope - I'd like to avoid hard-wiring 
accelerator buffers to devices a-priori. OpenCL, for example, defines 
memory buffers per context, not per device as it is the case in CUDA. 
So, Vec_OpenCL is just referring to the runtime rather than particular 
devices, so the name should be okay...



> Currently, the type indicates
> where native operations should "prefer" to compute, copying data there
> when necessary. The Vec operations have different implementations for
> CUDA and OpenCL so I don't see the problem with making them different
> derived classes. If we wanted a hybrid CUDA/OpenCL class, it would
> contain the logic for deciding where to do things followed by dispatch
> into the device-specific implementation, thus it doesn't seem like
> duplication to me.

Vec_CUDA vs. Vec_OpenCL is probably indeed okay. What I'd like to avoid 
is to end up with Vec_CUSP, Vec_CUDALib2, Vec_CUDALib3, etc., which are 
all operating on CUDA-handles, yet one cannot start kernels for e.g. the 
addition of a Vec_CUSP and a Vec_CUDALib2 without creating a temporary.

Best regards,
Karli




More information about the petsc-dev mailing list