[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Matthew Knepley knepley at gmail.com
Sat Oct 6 21:32:24 CDT 2012


On Sat, Oct 6, 2012 at 9:15 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Hi again,
>
>
>
>      Hmm, I thought that spptr is some 'special pointer' as commented in
>>     Mat, but not supposed to be a generic pointer to a derived class'
>>     datastructure (spptr is only injected with #define PETSC_HAVE_CUSP).
>>
>>
>> Look at cuspvecimpl.h, for example.
>>
>> #undef __FUNCT__
>> #define __FUNCT__ "VecCUSPGetArrayReadWrite"
>> PETSC_STATIC_INLINE PetscErrorCode VecCUSPGetArrayReadWrite(Vec v,
>> CUSPARRAY** a)
>> {
>>    PetscErrorCode ierr;
>>
>>    PetscFunctionBegin;
>>    *a   = 0;
>>    ierr = VecCUSPCopyToGPU(v);CHKERRQ(**ierr);
>>    *a   = ((Vec_CUSP *)v->spptr)->GPUarray;
>>    PetscFunctionReturn(0);
>> }
>>
>> Vec is following the convention from Mat where spptr points to the
>> Mat_UMFPACK, Mat_SuperLU, Mat_SeqBSTRM, etc, which hold the extra
>> information needed for that derived class (note that "derivation" for
>> Mat is typically done at run time, after the object has been created,
>> due to a call to MatGetFactor).
>>
>
> Yes, I've looked at this code prior to my opening email. I'm doubt that it
> is a good idea to *enforce* that handles are hidden somewhere behind spptr,
> because the memory is then bound to a particular library. Compare this with
> the CPU memory allocated by PETSc in *data: One is in principle free to
> pass the raw memory pointer to external libraries and let them operate on
> it. There can be library-specific data hidden behind *spptr if required,
> but I'd prefer to have the raw memory for the representation of the vector
> itself available at *data.
>
>
>
>>         With different devices, we could simply have a valid flag for each
>>         device. When someone does VecDevice1GetArrayWrite(), the flag
>>         for all
>>         other devices is marked invalid. When VecDevice2GetArrayRead() is
>>         called, the implementation copies from any valid device to
>> device2.
>>         Packing all those flags as bits in a single int is perhaps
>>         convenient,
>>         but not necessary.
>>
>>
>>     I think that the most common way of handling GPUs will be an
>>     overlapping decomposition of the host array, similar to how a vector
>>     is distributed via MPI (locally owned, writeable, vs ghost values
>>     with read-only). Assigning the full vector exclusively to just one
>>     device is more a single-GPU scenario rather than a multi-GPU use case.
>>
>>
>> Okay, the matrix will have to partition itself. What is the advantage of
>> having a single CPU process addressing multiple GPUs? Why not use
>> different MPI processes? (We can have the MPI processes sharing a node
>> create a subcomm so they can decide which process is driving which
>> device.)
>>
>
> Making MPI a prerequisite for multi-GPU usage would be a unnecessary
> restriction, wouldn't it?


Small point: I don't believe this, in fact the opposite. There are many
equivalent ways of doing these
things, and we should use the simplest and most structured that can
accomplish our goal. We have
already bought into MPI and should never fall into the trap of trying to
support another paradigm at
the expense of the simplicity.

   Matt


>
>> How would the user decide which device they wanted computation to run
>> on? (Also, Is OpenCL really the right name in an environment where there
>> may be multiple devices using OpenCL?)
>>
>
> That probably leaves the 'memory' scope - I'd like to avoid hard-wiring
> accelerator buffers to devices a-priori. OpenCL, for example, defines
> memory buffers per context, not per device as it is the case in CUDA. So,
> Vec_OpenCL is just referring to the runtime rather than particular devices,
> so the name should be okay...
>
>
>
>
>  Currently, the type indicates
>> where native operations should "prefer" to compute, copying data there
>> when necessary. The Vec operations have different implementations for
>> CUDA and OpenCL so I don't see the problem with making them different
>> derived classes. If we wanted a hybrid CUDA/OpenCL class, it would
>> contain the logic for deciding where to do things followed by dispatch
>> into the device-specific implementation, thus it doesn't seem like
>> duplication to me.
>>
>
> Vec_CUDA vs. Vec_OpenCL is probably indeed okay. What I'd like to avoid is
> to end up with Vec_CUSP, Vec_CUDALib2, Vec_CUDALib3, etc., which are all
> operating on CUDA-handles, yet one cannot start kernels for e.g. the
> addition of a Vec_CUSP and a Vec_CUDALib2 without creating a temporary.
>
> Best regards,
> Karli
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121006/c4319ef6/attachment.html>


More information about the petsc-dev mailing list