[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Sat Oct 6 17:26:52 CDT 2012

   Let's see if we can lift this discussion up another level and "treat" multi-core threading more specifically in the discussion (though Karl's subject name is Unification approach for OpenMP/Threads/... he largely ignores the multi-core/multi-socket aspect). 

    Abstractly a node has  

1)  a bunch of memories (some may be "nested" as caches "standing in" for parts of larger caches which "stand in" for parts of "main memory". )  In general, even without GPUs there are multiple memory sockets (though generally handled by the OS as a single unified address space),

2) a bunch of compute "thingies". In general, even without GPUs there are multiple CPUs, and each one of those likely has "regular" floating point units plus SIMD units.

A) Shri has started coding up a runtime dispatch system for computations on multiple cores (which hides differences between PThreads and OpenMP) that (currently) assumes Vecs are stored in a single array (each thread accesses the array pointer via VecGetArray() and then "its" part of the array by an offset.) (BTW: what if each of these VecGetArray() triggered a copy up from a GPU, probably a mess).  When using PThreads Shir's model allows (to some degree) the asynchronous launching of computational tasks. 

B) We have a different dispatch system for using a single GPU accelerator via CUDA that "automagically" handles copying data back and forth from memories via VecXXXGetArray(). It is synchronous on the GetArray() in  that is always blocks on the GetArray() until the data is there and then moves on to the computation. 

C) We are considering options for using OpenCL kernels. 

D) We have not seriously considering utilizing both GPUs and core processors for floating point intensive computations at the same time, either on the "same" object computation or completely different object computations. (note that DOE bought this huge machine at ORNL that seems to require this).

  Ideally we'd have a "single" high performing programming model for utilizing the resources of (1-2) regardless of details.

   Now, lets go to Karl's "Part 1: Memory" which is a good place to start.   In PETSc we basically have two data types, a Vec which is relatively easy to abstract about and a Mat which is not.  Let's focus just on the Vec now because Mat's are hard.

   We need to "divide up" the computation on a Vec (or several Vecs and Mats) so that the different compute "thingies" can work on their "piece", this division of the computation naturally is associated with a "division" of the data  (the division may actually be only abstract with pthreads or it may be concrete with two GPUs when "half" of the vector is copied to each GPU's memory (sorry Jed, I agree with Karl that we likely shouldn't hide this issue behind MPI)).  The "division" is non-overlapping in simple cases (like axpy()) or may require "ghosting" for  sparse matrix-vector products (again the division my only be abstract).  With multi-memory-socket multi-core we actually divide the vector data across physical memories but access it via virtual memory as not divided up for ghost points etc.  I think the "special cases" like virtual memory make it harder for us to think about this abstractly then it should be. 

   In PETSc we use the abstract object IS to indicate parts of Vecs\footnote.  Thus if a computation requires part of a vector it is natural to pass into the function the Vec AND THE IS indicating that part of the Vec needed. Note that Shri's use of code such as i=trstarts[thread_id] is actually a particular type of IS (hardwired for performance). 

   So, could we use a single kernel launcher for multi-core, CUDA, OpenCL based on this principle? Then VecCUDAGetArray() type things would keep track of parts of Vecs based on IS instead of all entries in the Vec.  Similarly there would be a VecMultiCoreGetArray(). Whenever possible the VecXXXGetArray() would not require copies.    As part of this model I'd also like to separate the "moving needed data" part of the kernel from the "computation on the data" so that everything doesn't block when data is being moved around. 

   Ok, how about moving this same model up to the MPI level? We already do this with IS converted to VecScatter (for performance) for updating ghost points (for matrix-vector products, for PDE ghost points etc) (note we can hide the VecScatter inside the IS and have it created as needed). 

   Note I intend this to continue the conversation, not end it. Thoughts?

  Barry

Footnote: Except when some people forget and make other unneeded complicated constructs that reproduce the functionality of IS. 

On Oct 6, 2012, at 9:09 AM, Jed Brown <jedbrown at mcs.anl.gov> wrote:

> On Sat, Oct 6, 2012 at 8:51 AM, Karl Rupp <rupp at mcs.anl.gov> wrote:
> Hmm, I thought that spptr is some 'special pointer' as commented in Mat, but not supposed to be a generic pointer to a derived class' datastructure (spptr is only injected with #define PETSC_HAVE_CUSP).
> 
> Look at cuspvecimpl.h, for example.
> 
> #undef __FUNCT__
> #define __FUNCT__ "VecCUSPGetArrayReadWrite"
> PETSC_STATIC_INLINE PetscErrorCode VecCUSPGetArrayReadWrite(Vec v, CUSPARRAY** a)
> {
>   PetscErrorCode ierr;
> 
>   PetscFunctionBegin;
>   *a   = 0;
>   ierr = VecCUSPCopyToGPU(v);CHKERRQ(ierr);
>   *a   = ((Vec_CUSP *)v->spptr)->GPUarray;
>   PetscFunctionReturn(0);
> }
> 
> Vec is following the convention from Mat where spptr points to the Mat_UMFPACK, Mat_SuperLU, Mat_SeqBSTRM, etc, which hold the extra information needed for that derived class (note that "derivation" for Mat is typically done at run time, after the object has been created, due to a call to MatGetFactor).
>  
> 
> With different devices, we could simply have a valid flag for each
> device. When someone does VecDevice1GetArrayWrite(), the flag for all
> other devices is marked invalid. When VecDevice2GetArrayRead() is
> called, the implementation copies from any valid device to device2.
> Packing all those flags as bits in a single int is perhaps convenient,
> but not necessary.
> 
> I think that the most common way of handling GPUs will be an overlapping decomposition of the host array, similar to how a vector is distributed via MPI (locally owned, writeable, vs ghost values with read-only). Assigning the full vector exclusively to just one device is more a single-GPU scenario rather than a multi-GPU use case.
> 
> Okay, the matrix will have to partition itself. What is the advantage of having a single CPU process addressing multiple GPUs? Why not use different MPI processes? (We can have the MPI processes sharing a node create a subcomm so they can decide which process is driving which device.)
>  
>  
> I think this stuff (which allows for segmenting the array on the device)
> can go in Vec_CUDA and Vec_OpenCL, basically just replacing the GPUarray
> member of Vec_CUSP. Why have a different PetscAcceleratorData struct?
> 
> If spptr is intended to be a generic pointer to data of the derived class, then this is also a possiblity. However, this would lead to
> Vec_CUDA, Vec_OpenCL, and Vec_CUDA_OpenCL, with the number of implementations rapidly increasing as one may eventually add other frameworks. The PetscAcceleratorData would essentially allow for a unification of Vec_CUDA, Vec_OpenCL, and Vec_CUDA_OpenCL, avoiding code duplication problems.
> 
> How would the user decide which device they wanted computation to run on? (Also, Is OpenCL really the right name in an environment where there may be multiple devices using OpenCL?) Currently, the type indicates where native operations should "prefer" to compute, copying data there when necessary. The Vec operations have different implementations for CUDA and OpenCL so I don't see the problem with making them different derived classes. If we wanted a hybrid CUDA/OpenCL class, it would contain the logic for deciding where to do things followed by dispatch into the device-specific implementation, thus it doesn't seem like duplication to me.
>  
> 
> 
> 
> 
>     Here, the PetscXYZHandleDescriptor holds
>       - the memory handle,
>       - the device ID the handles are valid for, and
>       - a flag whether the data is valid
>         (cf. valid_GPU_array, but with a much finer granularity).
>     Additional metainformation such as index ranges can be extended as
>     needed, cf. Vec_Seq vs Vec_MPI. Different types
>     Petsc*HandleDescriptors are expected to be required because the
>     various memory handle types are not guaranteed to have a particular
>     maximum size among different accelerator platforms.
> 
> 
> It sounds like you want to support marking only part of an array as
> stale. We could could keep one top-level (_p_Vec) flag indicating
> whether the CPU part was current, then in the specific implementation
> (Vec_OpenCL), you can hold finer granularity. Then when
> vec->ops->UpdateCPUArray() is called, you can look at the finer
> granularity flags to copy only what needs to be copied.
> 
> 
> Yes, I also thought of such a top-level-flag. This is, however, rather an optimization flag (similar to what is done in VecGetArray for petscnative), so I refrained from a separate discussion.
> 
> Aside from that, yes, I want to support parts of an array as stale, as the best multi-GPU use I've experienced so far is for block-based preconditioners (cf. Block-ILU-variants, parallel AMG flavors, etc.). A multi-GPU sparse matrix-vector product is handled similarly.
>