[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory
Karl Rupp
rupp at mcs.anl.gov
Sat Oct 6 20:15:34 CDT 2012
Hi again,
> Hmm, I thought that spptr is some 'special pointer' as commented in
> Mat, but not supposed to be a generic pointer to a derived class'
> datastructure (spptr is only injected with #define PETSC_HAVE_CUSP).
>
>
> Look at cuspvecimpl.h, for example.
>
> #undef __FUNCT__
> #define __FUNCT__ "VecCUSPGetArrayReadWrite"
> PETSC_STATIC_INLINE PetscErrorCode VecCUSPGetArrayReadWrite(Vec v,
> CUSPARRAY** a)
> {
> PetscErrorCode ierr;
>
> PetscFunctionBegin;
> *a = 0;
> ierr = VecCUSPCopyToGPU(v);CHKERRQ(ierr);
> *a = ((Vec_CUSP *)v->spptr)->GPUarray;
> PetscFunctionReturn(0);
> }
>
> Vec is following the convention from Mat where spptr points to the
> Mat_UMFPACK, Mat_SuperLU, Mat_SeqBSTRM, etc, which hold the extra
> information needed for that derived class (note that "derivation" for
> Mat is typically done at run time, after the object has been created,
> due to a call to MatGetFactor).
Yes, I've looked at this code prior to my opening email. I'm doubt that
it is a good idea to *enforce* that handles are hidden somewhere behind
spptr, because the memory is then bound to a particular library. Compare
this with the CPU memory allocated by PETSc in *data: One is in
principle free to pass the raw memory pointer to external libraries and
let them operate on it. There can be library-specific data hidden behind
*spptr if required, but I'd prefer to have the raw memory for the
representation of the vector itself available at *data.
>
> With different devices, we could simply have a valid flag for each
> device. When someone does VecDevice1GetArrayWrite(), the flag
> for all
> other devices is marked invalid. When VecDevice2GetArrayRead() is
> called, the implementation copies from any valid device to device2.
> Packing all those flags as bits in a single int is perhaps
> convenient,
> but not necessary.
>
>
> I think that the most common way of handling GPUs will be an
> overlapping decomposition of the host array, similar to how a vector
> is distributed via MPI (locally owned, writeable, vs ghost values
> with read-only). Assigning the full vector exclusively to just one
> device is more a single-GPU scenario rather than a multi-GPU use case.
>
>
> Okay, the matrix will have to partition itself. What is the advantage of
> having a single CPU process addressing multiple GPUs? Why not use
> different MPI processes? (We can have the MPI processes sharing a node
> create a subcomm so they can decide which process is driving which device.)
Making MPI a prerequisite for multi-GPU usage would be a unnecessary
restriction, wouldn't it?
>
> How would the user decide which device they wanted computation to run
> on? (Also, Is OpenCL really the right name in an environment where there
> may be multiple devices using OpenCL?)
That probably leaves the 'memory' scope - I'd like to avoid hard-wiring
accelerator buffers to devices a-priori. OpenCL, for example, defines
memory buffers per context, not per device as it is the case in CUDA.
So, Vec_OpenCL is just referring to the runtime rather than particular
devices, so the name should be okay...
> Currently, the type indicates
> where native operations should "prefer" to compute, copying data there
> when necessary. The Vec operations have different implementations for
> CUDA and OpenCL so I don't see the problem with making them different
> derived classes. If we wanted a hybrid CUDA/OpenCL class, it would
> contain the logic for deciding where to do things followed by dispatch
> into the device-specific implementation, thus it doesn't seem like
> duplication to me.
Vec_CUDA vs. Vec_OpenCL is probably indeed okay. What I'd like to avoid
is to end up with Vec_CUSP, Vec_CUDALib2, Vec_CUDALib3, etc., which are
all operating on CUDA-handles, yet one cannot start kernels for e.g. the
addition of a Vec_CUSP and a Vec_CUDALib2 without creating a temporary.
Best regards,
Karli
More information about the petsc-dev
mailing list