[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Jed Brown jedbrown at mcs.anl.gov
Sat Oct 6 22:01:12 CDT 2012


On Sat, Oct 6, 2012 at 9:16 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Still, one may 'fake' a std::deque by holding meta-information about the
> physical memory pages nevertheless, yet allocate a (virtually) linear piece
> of memory to keep compatibility with XYZGetArray(). This would allow some
> nice optimizations for the threading scheduler, as threads may operate on
> 'nearer' pages first.
>

So we sort of have this in the affinity associated with the thread
communicator. We expect first-touch semantics (I wish we could use Linux's
libnuma to be more explicit) so the page mapping can be computed using the
page size and affinity.

In my opinion, the critical issue of separate versus unified memory arises
in kernels like MatMult. If we partition the rows of the matrix among
threads (I don't want to assume a contiguous partition because I want to
share cache, but this argument works anyway), we generally need access to
parts of the vector that are not in our local segment. In general, we may
need to access some entries all segments of the vector. If we are not going
to expose vectors via contiguous storage, we will be obligated to partition
our matrices into explicit column segments, add a level of indirection to
vector access, or explicitly copy the ghost parts into thread-local buffers.

I don't like any of these options because of implementation complexity and
because the ghost regions are much larger than the interiors as we strong
scale. Note that the principle reason for using threads in the first place
was to avoid needing these copies. If we were going to always copy into
local buffers, we could use an MPI-type model all the way down. The
defining feature of threads is that they can use a common address space.

MatMultTranspose is somewhat more complicated since the output space is
overlapping.


> Btw: Shri's threading communicator is almost identical to the OpenCL model
> (with the latter having a few additional capabilities).
>
>
Can you elaborate on this? It is still a good time to refine our threadcomm
model.


> A bit of a spoiler for the actual job runtime (more brainstorming than
> complete suggestions):
> I can imagine submitting the Vec, the IS, and the type of job to the
> scheduler, possibly including some hints on the type of operations to
> follow. One may even enforce a certain type of device here, even though
> this requires the scheduler to move the data in place first. In this way
> one can perform smaller tasks on the respective CPU core (if we keep track
> of affinity information), and offload larger tasks to an available
> accelerator if possible. (Note that this is the main reason why I don't
> want to hide buffers in library-specific derived classes of Vec). The
> scheduler can use simple heuristics on where to perform operations based on
> typical latencies (e.g. ~20us for a GPU kernel)
>

Where does the scheduler go in this approach? Some users have a TS with a
small number of degrees of freedom (e.g. reduced chemistry), so we want to
keep "nothread" kernel launch down in the range similar to an indirect
function call.

I think it would be good for the GPU subtype to have a threshold and fall
back to a CPU implementation for small enough sizes, but I'm reluctant to
have a scheduler looking at all Vec implementations.


> Yes, we could/can. A single kernel launcher also allows for fusing
> kernels, e.g. matrix-vector-product followed by an inner product of the
> result vector with some other vector. As outlined above, asynchronous data
> movement could even be the default rather than the exception, except for
> cases where one gives control over the data to the outside by e.g.
> returning a pointer to the array. In such cases one would have first wait
> for all operations on all data to finish.
>
> The main concern in all that is the readiness of the user. Awareness for
> asynchronous operations keeps rising, yet I can imagine user code like
>
>  PetscScalar * data = VecXYZGetArray(v1); // flushes the queue suitably
>  data[0] = VecDot(v2, v3);                // enqueues VecDot
>  PetscScalar s = data[0];                 // VecDot may not be finished!
>
> where a pointer given away once undermines everything.
>

This is why the VecDot kernel does not return a raw result. You can call
VecDot_kernel() collectively from another kernel, but you pass in a
PetscThreadReduction and only reduce it when you need the result.

If a non-kernel caller wants an asynchronous interface, they would use
VecDotBegin() and VecDotEnd(). Note that several reductions (dot products,
norms, etc) can be queued up with VecXXBegin() calls, yet only one
reduction (optionally also asynchronous) is performed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121006/fbab7fef/attachment.html>


More information about the petsc-dev mailing list