[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Sat Oct 6 21:52:57 CDT 2012

On Oct 6, 2012, at 9:16 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Hi Barry,
> 
>>    Let's see if we can lift this discussion up another level and "treat" multi-core threading more specifically in the discussion (though Karl's subject name is Unification approach for OpenMP/Threads/... he largely ignores the multi-core/multi-socket aspect).
> 
> Probably I should have called the discussion 'Part 1: Memory Handles', yet I'm fine with considering multi-core issues as well.
> 
> 
>>     Abstractly a node has
>> 
>> 1)  a bunch of memories (some may be "nested" as caches "standing in" for parts of larger caches which "stand in" for parts of "main memory". )  In general, even without GPUs there are multiple memory sockets (though generally handled by the OS as a single unified address space),
>> 
>> 2) a bunch of compute "thingies". In general, even without GPUs there are multiple CPUs, and each one of those likely has "regular" floating point units plus SIMD units.
>> 
>> 
>> A) Shri has started coding up a runtime dispatch system for computations on multiple cores (which hides differences between PThreads and OpenMP) that (currently) assumes Vecs are stored in a single array (each thread accesses the array pointer via VecGetArray() and then "its" part of the array by an offset.) (BTW: what if each of these VecGetArray() triggered a copy up from a GPU, probably a mess).  When using PThreads Shir's model allows (to some degree) the asynchronous launching of computational tasks.
>> 
> I've discussed multi-core related topics (NUMA, first-touch) a little with Shri. As the operating system performs the allocation 'automagically' (per default), a single virtually linear piece of memory pretty much performs reasonably. I didn't dare to start a discussion of handling buffers in main memory similarly to std::deque, i.e. as a collection of individually allocated pages, and try to keep track of locality informations. This, however, would completely break any XYZGetArray() code, as the function enforces a large chunk of linear memory again.
> 
> Still, one may 'fake' a std::deque by holding meta-information about the physical memory pages nevertheless, yet allocate a (virtually) linear piece of memory to keep compatibility with XYZGetArray(). This would allow some nice optimizations for the threading scheduler, as threads may operate on 'nearer' pages first.
> 
> 
>> B) We have a different dispatch system for using a single GPU accelerator via CUDA that "automagically" handles copying data back and forth from memories via VecXXXGetArray(). It is synchronous on the GetArray() in  that is always blocks on the GetArray() until the data is there and then moves on to the computation.
> 
> I'm afraid that the return type of VecXXXGetArray(), i.e. a pointer to the data, is such a strong requirement that one cannot relax the blocking transfer here.
> 
> However, we could use the thread communicator, schedule an asynchronous memory transfer via a non-blocking VecXXXGetArrayRequest() returning an event object, possibly perform some other operations in the meanwhile, and finally sync to the event object at the time we actually start modifying the array.
> 
> 
>> C) We are considering options for using OpenCL kernels.
> 
> Btw: Shri's threading communicator is almost identical to the OpenCL model (with the latter having a few additional capabilities).
> 
> 
>> D) We have not seriously considering utilizing both GPUs and core processors for floating point intensive computations at the same time, either on the "same" object computation or completely different object computations. (note that DOE bought this huge machine at ORNL that seems to require this).
>> 
>>   Ideally we'd have a "single" high performing programming model for utilizing the resources of (1-2) regardless of details.
> 
> I'm pretty confident that feeding operations (including GPU operations) into the task queue of the thread communicator will give good results.

    Problem already solved http://dl.acm.org/citation.cfm?id=2145863   :-)

> 
> 
>> 
>>    Now, lets go to Karl's "Part 1: Memory" which is a good place to start.   In PETSc we basically have two data types, a Vec which is relatively easy to abstract about and a Mat which is not.  Let's focus just on the Vec now because Mat's are hard.
>> 
>>    We need to "divide up" the computation on a Vec (or several Vecs and Mats) so that the different compute "thingies" can work on their "piece", this division of the computation naturally is associated with a "division" of the data  (the division may actually be only abstract with pthreads or it may be concrete with two GPUs when "half" of the vector is copied to each GPU's memory (sorry Jed, I agree with Karl that we likely shouldn't hide this issue behind MPI)).  The "division" is non-overlapping in simple cases (like axpy()) or may require "ghosting" for  sparse matrix-vector products (again the division my only be abstract).  With multi-memory-socket multi-core we actually divide the vector data across physical memories but access it via virtual memory as not divided up for ghost points etc.  I think the "special cases" like virtual memory make it harder for us to think about this abstractly then it should be.
>> 
>>    In PETSc we use the abstract object IS to indicate parts of Vecs\footnote.  Thus if a computation requires part of a vector it is natural to pass into the function the Vec AND THE IS indicating that part of the Vec needed. Note that Shri's use of code such as i=trstarts[thread_id] is actually a particular type of IS (hardwired for performance).
> 
> A bit of a spoiler for the actual job runtime (more brainstorming than complete suggestions):
> I can imagine submitting the Vec, the IS, and the type of job to the scheduler, possibly including some hints on the type of operations to follow. One may even enforce a certain type of device here, even though this requires the scheduler to move the data in place first. In this way one can perform smaller tasks on the respective CPU core (if we keep track of affinity information), and offload larger tasks to an available accelerator if possible. (Note that this is the main reason why I don't want to hide buffers in library-specific derived classes of Vec). The scheduler can use simple heuristics on where to perform operations based on typical latencies (e.g. ~20us for a GPU kernel)
> 
> 
>>    So, could we use a single kernel launcher for multi-core, CUDA, OpenCL based on this principle? Then VecCUDAGetArray() type things would keep track of parts of Vecs based on IS instead of all entries in the Vec.  Similarly there would be a VecMultiCoreGetArray(). Whenever possible the VecXXXGetArray() would not require copies.    As part of this model I'd also like to separate the "moving needed data" part of the kernel from the "computation on the data" so that everything doesn't block when data is being moved around.
> 
> Yes, we could/can. A single kernel launcher also allows for fusing kernels, e.g. matrix-vector-product followed by an inner product of the result vector with some other vector. As outlined above, asynchronous data movement could even be the default rather than the exception, except for cases where one gives control over the data to the outside by e.g. returning a pointer to the array. In such cases one would have first wait for all operations on all data to finish.
> 
> The main concern in all that is the readiness of the user. Awareness for asynchronous operations keeps rising, yet I can imagine user code like
> 
> PetscScalar * data = VecXYZGetArray(v1); // flushes the queue suitably
> data[0] = VecDot(v2, v3);                // enqueues VecDot
> PetscScalar s = data[0];                 // VecDot may not be finished!
> 
> where a pointer given away once undermines everything.
> 
> 
>>    Ok, how about moving this same model up to the MPI level? We already do this with IS converted to VecScatter (for performance) for updating ghost points (for matrix-vector products, for PDE ghost points etc) (note we can hide the VecScatter inside the IS and have it created as needed).
> 
> I'm afraid I can't contribute to the MPI discussion yet, as I don't know enough about things are handled now...
> 
> Best regards,
> Karli
> 
>