[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Sat Oct 6 22:01:56 CDT 2012

On Sat, Oct 6, 2012 at 10:52 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
> On Oct 6, 2012, at 9:16 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:
>
> > Hi Barry,
> >
> >>    Let's see if we can lift this discussion up another level and
> "treat" multi-core threading more specifically in the discussion (though
> Karl's subject name is Unification approach for OpenMP/Threads/... he
> largely ignores the multi-core/multi-socket aspect).
> >
> > Probably I should have called the discussion 'Part 1: Memory Handles',
> yet I'm fine with considering multi-core issues as well.
> >
> >
> >>     Abstractly a node has
> >>
> >> 1)  a bunch of memories (some may be "nested" as caches "standing in"
> for parts of larger caches which "stand in" for parts of "main memory". )
>  In general, even without GPUs there are multiple memory sockets (though
> generally handled by the OS as a single unified address space),
> >>
> >> 2) a bunch of compute "thingies". In general, even without GPUs there
> are multiple CPUs, and each one of those likely has "regular" floating
> point units plus SIMD units.
> >>
> >>
> >> A) Shri has started coding up a runtime dispatch system for
> computations on multiple cores (which hides differences between PThreads
> and OpenMP) that (currently) assumes Vecs are stored in a single array
> (each thread accesses the array pointer via VecGetArray() and then "its"
> part of the array by an offset.) (BTW: what if each of these VecGetArray()
> triggered a copy up from a GPU, probably a mess).  When using PThreads
> Shir's model allows (to some degree) the asynchronous launching of
> computational tasks.
> >>
> > I've discussed multi-core related topics (NUMA, first-touch) a little
> with Shri. As the operating system performs the allocation 'automagically'
> (per default), a single virtually linear piece of memory pretty much
> performs reasonably. I didn't dare to start a discussion of handling
> buffers in main memory similarly to std::deque, i.e. as a collection of
> individually allocated pages, and try to keep track of locality
> informations. This, however, would completely break any XYZGetArray() code,
> as the function enforces a large chunk of linear memory again.
> >
> > Still, one may 'fake' a std::deque by holding meta-information about the
> physical memory pages nevertheless, yet allocate a (virtually) linear piece
> of memory to keep compatibility with XYZGetArray(). This would allow some
> nice optimizations for the threading scheduler, as threads may operate on
> 'nearer' pages first.
> >
> >
> >> B) We have a different dispatch system for using a single GPU
> accelerator via CUDA that "automagically" handles copying data back and
> forth from memories via VecXXXGetArray(). It is synchronous on the
> GetArray() in  that is always blocks on the GetArray() until the data is
> there and then moves on to the computation.
> >
> > I'm afraid that the return type of VecXXXGetArray(), i.e. a pointer to
> the data, is such a strong requirement that one cannot relax the blocking
> transfer here.
> >
> > However, we could use the thread communicator, schedule an asynchronous
> memory transfer via a non-blocking VecXXXGetArrayRequest() returning an
> event object, possibly perform some other operations in the meanwhile, and
> finally sync to the event object at the time we actually start modifying
> the array.
> >
> >
> >> C) We are considering options for using OpenCL kernels.
> >
> > Btw: Shri's threading communicator is almost identical to the OpenCL
> model (with the latter having a few additional capabilities).
> >
> >
> >> D) We have not seriously considering utilizing both GPUs and core
> processors for floating point intensive computations at the same time,
> either on the "same" object computation or completely different object
> computations. (note that DOE bought this huge machine at ORNL that seems to
> require this).
> >>
> >>   Ideally we'd have a "single" high performing programming model for
> utilizing the resources of (1-2) regardless of details.
> >
> > I'm pretty confident that feeding operations (including GPU operations)
> into the task queue of the thread communicator will give good results.
>
>     Problem already solved http://dl.acm.org/citation.cfm?id=2145863   :-)

Great, already cracked open my growler.

For me, a lot of this discussion is somewhat premature. When programming
these things, the hard part is
always associating the right data with the kernel. This is where all the
crufty code, packing, peeking into the
queue and so forth happens. All the current articles (including Carter's
completely vacuous work queue stuff)
assume that this is easy and go from there. This is completely backwards,
and I think trying to design the
header first is also backwards.

I like Barry's brutal reduction of the interface to Vec+IS, which is good.
However, it does ignore that some poor
sap has to make that IS, and that is where these other constructs (like
PetscSection) come in. I would like to
hear (or see) some examples that show how we are going to control this
complexity. For instance, in my
experience, writing kernels for the fields you are solving for is easy (I
have lots of stuff for this already in
PETSc). However, where is the ship founders is in writing kernels that need
auxiliary data, which is always the
case for real problems. Take as an example variable-viscosity Stokes where
the viscosity field is input.

   Matt

> >
> >
> >>
> >>    Now, lets go to Karl's "Part 1: Memory" which is a good place to
> start.   In PETSc we basically have two data types, a Vec which is
> relatively easy to abstract about and a Mat which is not.  Let's focus just
> on the Vec now because Mat's are hard.
> >>
> >>    We need to "divide up" the computation on a Vec (or several Vecs and
> Mats) so that the different compute "thingies" can work on their "piece",
> this division of the computation naturally is associated with a "division"
> of the data  (the division may actually be only abstract with pthreads or
> it may be concrete with two GPUs when "half" of the vector is copied to
> each GPU's memory (sorry Jed, I agree with Karl that we likely shouldn't
> hide this issue behind MPI)).  The "division" is non-overlapping in simple
> cases (like axpy()) or may require "ghosting" for  sparse matrix-vector
> products (again the division my only be abstract).  With
> multi-memory-socket multi-core we actually divide the vector data across
> physical memories but access it via virtual memory as not divided up for
> ghost points etc.  I think the "special cases" like virtual memory make it
> harder for us to think about this abstractly then it should be.
> >>
> >>    In PETSc we use the abstract object IS to indicate parts of
> Vecs\footnote.  Thus if a computation requires part of a vector it is
> natural to pass into the function the Vec AND THE IS indicating that part
> of the Vec needed. Note that Shri's use of code such as
> i=trstarts[thread_id] is actually a particular type of IS (hardwired for
> performance).
> >
> > A bit of a spoiler for the actual job runtime (more brainstorming than
> complete suggestions):
> > I can imagine submitting the Vec, the IS, and the type of job to the
> scheduler, possibly including some hints on the type of operations to
> follow. One may even enforce a certain type of device here, even though
> this requires the scheduler to move the data in place first. In this way
> one can perform smaller tasks on the respective CPU core (if we keep track
> of affinity information), and offload larger tasks to an available
> accelerator if possible. (Note that this is the main reason why I don't
> want to hide buffers in library-specific derived classes of Vec). The
> scheduler can use simple heuristics on where to perform operations based on
> typical latencies (e.g. ~20us for a GPU kernel)
> >
> >
> >>    So, could we use a single kernel launcher for multi-core, CUDA,
> OpenCL based on this principle? Then VecCUDAGetArray() type things would
> keep track of parts of Vecs based on IS instead of all entries in the Vec.
>  Similarly there would be a VecMultiCoreGetArray(). Whenever possible the
> VecXXXGetArray() would not require copies.    As part of this model I'd
> also like to separate the "moving needed data" part of the kernel from the
> "computation on the data" so that everything doesn't block when data is
> being moved around.
> >
> > Yes, we could/can. A single kernel launcher also allows for fusing
> kernels, e.g. matrix-vector-product followed by an inner product of the
> result vector with some other vector. As outlined above, asynchronous data
> movement could even be the default rather than the exception, except for
> cases where one gives control over the data to the outside by e.g.
> returning a pointer to the array. In such cases one would have first wait
> for all operations on all data to finish.
> >
> > The main concern in all that is the readiness of the user. Awareness for
> asynchronous operations keeps rising, yet I can imagine user code like
> >
> > PetscScalar * data = VecXYZGetArray(v1); // flushes the queue suitably
> > data[0] = VecDot(v2, v3);                // enqueues VecDot
> > PetscScalar s = data[0];                 // VecDot may not be finished!
> >
> > where a pointer given away once undermines everything.
> >
> >
> >>    Ok, how about moving this same model up to the MPI level? We already
> do this with IS converted to VecScatter (for performance) for updating
> ghost points (for matrix-vector products, for PDE ghost points etc) (note
> we can hide the VecScatter inside the IS and have it created as needed).
> >
> > I'm afraid I can't contribute to the MPI discussion yet, as I don't know
> enough about things are handled now...
> >
> > Best regards,
> > Karli
> >
> >
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121006/a5ef6809/attachment.html>