[petsc-dev] Current status: GPUs for PETSc

Sun Nov 4 19:51:26 CST 2012

Hi guys,

I've made quite some progress with my unification approach for GPUs and 
threading. Here's how my toy framework is set up:

Each (linear algebra) object is equipped with a memory handle, e.g.

struct Vector {
    /* other stuff here */

    memory_handle my_handle;
};

The handle itself, however, is not a-priori bound to a particular 
domain, but rather holds a collection of raw handles. Ignoring 
implementation details, this looks as follows:

struct memory_handle {
    void *      cpu_data;
    cl_mem   opencl_data;
    void *     cuda_data;
    size_t active_handle;
};

Right now, active_handle is used to indicate which of the handles holds 
the current data. This could be further refined to holding a valid-flag 
for each of the raw pointers. Adding all the management logic, kernels, 
etc., one then obtains user code comparable to

   Vector v1, v2, v3;
   // ...
   GenericVectorAdd(v1, v2, v3); //C++: v1 = v2 + v3;

Now, GenericVectorAdd() can dispatch into the respective memory region 
(maybe using a delegator here):

   switch(v1.my_handle.active_handle){
     case   MAIN_MEMORY:  perform_add_in_main_memory();   break;
     case OPENCL_MEMORY:  perform_add_in_opencl_memory(); break;
     case   CUDA_MEMORY:  perform_add_in_cuda_memory();   break;
     default:             error_handling();               break;
   }

For the cases where multiple handles hold current data, appropriate 
priorizations can be applied. Also, for each memory region one can 
further dispatch into other libraries:

void perform_add_in_main_memory(...) {
    if (with_thread_pool) { ... }
    if (with_blas_lib1)   { ... }
    ...
}

and similarly for OpenCL and CUDA. Perhaps most important for PETSc, 
bindings to CUSparse can be applied in a straightforward manner without 
duplicating code from the CUSP bindings if memory handles are not 
entirely hidden in each of the libraries.

----------

And now for some benchmarks I've run within the multi-backend framework 
using double precision. An AMD Radeon HD 7970 is now operational, 
competing with an NVIDIA GTX 285 and an Intel Xeon X5550 using a single 
thread. The NVIDIA GPU is run both with OpenCL and CUDA, leading to 
quite interesting differences.

The first benchmark (vector.png) is a simple vector addition of the form 
x = ay + bz for various vector sizes. The AMD Radeon is expected to 
outperform NVIDIA GPUs for large problem sizes due to higher memory 
bandwidth (264 GB/sec vs. 159 GB/sec). However, it suffers from large 
overhead at smaller vector sizes (the kink at about 10k is due to the 
higher number of repetitions at smaller sizes). CUDA code at small 
vector sizes is about a factor 2 faster than an OpenCL implementation on 
the same hardware. The CPU is the best choice for anything below ~10k 
entries.

Next, sparse matrix-vector multiplications are compared in spmv.png. The 
matrix is obtained from a finite difference discretization of the unit 
square in 2d, and the unit cube in 3d, respectively, using 
lexicographical ordering (no bandwidth reductions such as 
Cuthill-McKee). Timings for the sparse formats (CSR, vectorized CSR, 
COO, ELL, HYB) were recorded and the fastest among these considered in 
the graph for each case. The AMD GPU ultimately gives the best 
performance at large problem sizes, but shows some weird performance 
characteristics in 2d. CUDA again has a lower overhead at smaller 
problem sizes, but otherwise shows the same performance as OpenCL. The 
2d-OpenCL performance on the NVIDIA-GPU is qualitatively the same as in 
3d, I just forgot to transfer the data from the lab. The CPU 
implementation suffers from heavy cache misses in 3d at larger problem 
sizes.

Finally, something more practical, yet still a bit synthetic: Timings 
for 50 CG iterations with standard BLAS kernels (no fusing of vector 
operations). Surprisingly, the overhead of OpenCL on the AMD GPU now 
even becomes an order of magnitude at small problem sizes. This is 
mostly due to the transfer of vector norms for error checking, so the 
AMD driver seems to have quite some problems with that. Also, the 
difference between OpenCL and CUDA is now more pronounced at smaller 
systems, even though asymptotically the performance is the same. The 
CUDA implementation failed at large problem sizes, I still need to 
investigate that. Still, the trend of matching with OpenCL at larger 
problem sizes is readily visible.

That's it for now, after some more refining I'll start with a careful 
migration of the code/concepts into PETSc. Comments are, of course, 
always welcome.

Best regards,
Karli

PS: Am I the only one who finds the explicit requirement of nvcc for 
CUDA-code awful? Sure, it is convenient to write CUDA device code 
inline, but I wasn't able to get around nvcc by using precompiled PTX 
code with any reasonable effort. Is there anybody with more luck/skill 
on petsc-dev?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: vector.png
Type: image/png
Size: 80272 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121104/4456525e/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spmv.png
Type: image/png
Size: 113400 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121104/4456525e/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cg.png
Type: image/png
Size: 156137 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121104/4456525e/attachment-0002.png>