[petsc-dev] Current status: GPUs for PETSc

Matthew Knepley knepley at gmail.com
Sun Nov 4 20:21:10 CST 2012


On Sun, Nov 4, 2012 at 8:51 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Hi guys,
>
> I've made quite some progress with my unification approach for GPUs and
> threading. Here's how my toy framework is set up:
>

Very cool. I did not expect the large AMD overheads, but the results below
make the current CUDA strategy look
pretty good, and an OpenCL strategy look fine for large problems.

I have a more basic question. I saw the previous implementation in very
simple terms:

  1) Replication of CPU vectors on the GPU

  2) A coherence policy

The nice thing here is how robust it is. We really do not have to commit to
any implementation because
the CPU part can always pick up the slack. From what is written below, I
cannot understand how the
"coherence policy" works.

Lets use an example to explain it to me. Say that you have a CUDA vector,
but you want to execute
VecPointwiseMult() with another Vec, but that operation is not part of your
CUDA implementation.
What happens?

   Thanks,

      Matt


> Each (linear algebra) object is equipped with a memory handle, e.g.
>
> struct Vector {
>    /* other stuff here */
>
>    memory_handle my_handle;
> };
>
> The handle itself, however, is not a-priori bound to a particular domain,
> but rather holds a collection of raw handles. Ignoring implementation
> details, this looks as follows:
>
> struct memory_handle {
>    void *      cpu_data;
>    cl_mem   opencl_data;
>    void *     cuda_data;
>    size_t active_handle;
> };
>
> Right now, active_handle is used to indicate which of the handles holds
> the current data. This could be further refined to holding a valid-flag for
> each of the raw pointers. Adding all the management logic, kernels, etc.,
> one then obtains user code comparable to
>
>   Vector v1, v2, v3;
>   // ...
>   GenericVectorAdd(v1, v2, v3); //C++: v1 = v2 + v3;
>
> Now, GenericVectorAdd() can dispatch into the respective memory region
> (maybe using a delegator here):
>
>   switch(v1.my_handle.active_**handle){
>     case   MAIN_MEMORY:  perform_add_in_main_memory();   break;
>     case OPENCL_MEMORY:  perform_add_in_opencl_memory()**; break;
>     case   CUDA_MEMORY:  perform_add_in_cuda_memory();   break;
>     default:             error_handling();               break;
>   }
>
> For the cases where multiple handles hold current data, appropriate
> priorizations can be applied. Also, for each memory region one can further
> dispatch into other libraries:
>
> void perform_add_in_main_memory(...**) {
>    if (with_thread_pool) { ... }
>    if (with_blas_lib1)   { ... }
>    ...
> }
>
> and similarly for OpenCL and CUDA. Perhaps most important for PETSc,
> bindings to CUSparse can be applied in a straightforward manner without
> duplicating code from the CUSP bindings if memory handles are not entirely
> hidden in each of the libraries.
>
> ----------
>
> And now for some benchmarks I've run within the multi-backend framework
> using double precision. An AMD Radeon HD 7970 is now operational, competing
> with an NVIDIA GTX 285 and an Intel Xeon X5550 using a single thread. The
> NVIDIA GPU is run both with OpenCL and CUDA, leading to quite interesting
> differences.
>
> The first benchmark (vector.png) is a simple vector addition of the form x
> = ay + bz for various vector sizes. The AMD Radeon is expected to
> outperform NVIDIA GPUs for large problem sizes due to higher memory
> bandwidth (264 GB/sec vs. 159 GB/sec). However, it suffers from large
> overhead at smaller vector sizes (the kink at about 10k is due to the
> higher number of repetitions at smaller sizes). CUDA code at small vector
> sizes is about a factor 2 faster than an OpenCL implementation on the same
> hardware. The CPU is the best choice for anything below ~10k entries.
>
> Next, sparse matrix-vector multiplications are compared in spmv.png. The
> matrix is obtained from a finite difference discretization of the unit
> square in 2d, and the unit cube in 3d, respectively, using lexicographical
> ordering (no bandwidth reductions such as Cuthill-McKee). Timings for the
> sparse formats (CSR, vectorized CSR, COO, ELL, HYB) were recorded and the
> fastest among these considered in the graph for each case. The AMD GPU
> ultimately gives the best performance at large problem sizes, but shows
> some weird performance characteristics in 2d. CUDA again has a lower
> overhead at smaller problem sizes, but otherwise shows the same performance
> as OpenCL. The 2d-OpenCL performance on the NVIDIA-GPU is qualitatively the
> same as in 3d, I just forgot to transfer the data from the lab. The CPU
> implementation suffers from heavy cache misses in 3d at larger problem
> sizes.
>
> Finally, something more practical, yet still a bit synthetic: Timings for
> 50 CG iterations with standard BLAS kernels (no fusing of vector
> operations). Surprisingly, the overhead of OpenCL on the AMD GPU now even
> becomes an order of magnitude at small problem sizes. This is mostly due to
> the transfer of vector norms for error checking, so the AMD driver seems to
> have quite some problems with that. Also, the difference between OpenCL and
> CUDA is now more pronounced at smaller systems, even though asymptotically
> the performance is the same. The CUDA implementation failed at large
> problem sizes, I still need to investigate that. Still, the trend of
> matching with OpenCL at larger problem sizes is readily visible.
>
> That's it for now, after some more refining I'll start with a careful
> migration of the code/concepts into PETSc. Comments are, of course, always
> welcome.
>
> Best regards,
> Karli
>
>
> PS: Am I the only one who finds the explicit requirement of nvcc for
> CUDA-code awful? Sure, it is convenient to write CUDA device code inline,
> but I wasn't able to get around nvcc by using precompiled PTX code with any
> reasonable effort. Is there anybody with more luck/skill on petsc-dev?
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121104/95956097/attachment.html>


More information about the petsc-dev mailing list