[petsc-dev] Current status: GPUs for PETSc
Karl Rupp
rupp at mcs.anl.gov
Sun Nov 4 19:51:26 CST 2012
Hi guys,
I've made quite some progress with my unification approach for GPUs and
threading. Here's how my toy framework is set up:
Each (linear algebra) object is equipped with a memory handle, e.g.
struct Vector {
/* other stuff here */
memory_handle my_handle;
};
The handle itself, however, is not a-priori bound to a particular
domain, but rather holds a collection of raw handles. Ignoring
implementation details, this looks as follows:
struct memory_handle {
void * cpu_data;
cl_mem opencl_data;
void * cuda_data;
size_t active_handle;
};
Right now, active_handle is used to indicate which of the handles holds
the current data. This could be further refined to holding a valid-flag
for each of the raw pointers. Adding all the management logic, kernels,
etc., one then obtains user code comparable to
Vector v1, v2, v3;
// ...
GenericVectorAdd(v1, v2, v3); //C++: v1 = v2 + v3;
Now, GenericVectorAdd() can dispatch into the respective memory region
(maybe using a delegator here):
switch(v1.my_handle.active_handle){
case MAIN_MEMORY: perform_add_in_main_memory(); break;
case OPENCL_MEMORY: perform_add_in_opencl_memory(); break;
case CUDA_MEMORY: perform_add_in_cuda_memory(); break;
default: error_handling(); break;
}
For the cases where multiple handles hold current data, appropriate
priorizations can be applied. Also, for each memory region one can
further dispatch into other libraries:
void perform_add_in_main_memory(...) {
if (with_thread_pool) { ... }
if (with_blas_lib1) { ... }
...
}
and similarly for OpenCL and CUDA. Perhaps most important for PETSc,
bindings to CUSparse can be applied in a straightforward manner without
duplicating code from the CUSP bindings if memory handles are not
entirely hidden in each of the libraries.
----------
And now for some benchmarks I've run within the multi-backend framework
using double precision. An AMD Radeon HD 7970 is now operational,
competing with an NVIDIA GTX 285 and an Intel Xeon X5550 using a single
thread. The NVIDIA GPU is run both with OpenCL and CUDA, leading to
quite interesting differences.
The first benchmark (vector.png) is a simple vector addition of the form
x = ay + bz for various vector sizes. The AMD Radeon is expected to
outperform NVIDIA GPUs for large problem sizes due to higher memory
bandwidth (264 GB/sec vs. 159 GB/sec). However, it suffers from large
overhead at smaller vector sizes (the kink at about 10k is due to the
higher number of repetitions at smaller sizes). CUDA code at small
vector sizes is about a factor 2 faster than an OpenCL implementation on
the same hardware. The CPU is the best choice for anything below ~10k
entries.
Next, sparse matrix-vector multiplications are compared in spmv.png. The
matrix is obtained from a finite difference discretization of the unit
square in 2d, and the unit cube in 3d, respectively, using
lexicographical ordering (no bandwidth reductions such as
Cuthill-McKee). Timings for the sparse formats (CSR, vectorized CSR,
COO, ELL, HYB) were recorded and the fastest among these considered in
the graph for each case. The AMD GPU ultimately gives the best
performance at large problem sizes, but shows some weird performance
characteristics in 2d. CUDA again has a lower overhead at smaller
problem sizes, but otherwise shows the same performance as OpenCL. The
2d-OpenCL performance on the NVIDIA-GPU is qualitatively the same as in
3d, I just forgot to transfer the data from the lab. The CPU
implementation suffers from heavy cache misses in 3d at larger problem
sizes.
Finally, something more practical, yet still a bit synthetic: Timings
for 50 CG iterations with standard BLAS kernels (no fusing of vector
operations). Surprisingly, the overhead of OpenCL on the AMD GPU now
even becomes an order of magnitude at small problem sizes. This is
mostly due to the transfer of vector norms for error checking, so the
AMD driver seems to have quite some problems with that. Also, the
difference between OpenCL and CUDA is now more pronounced at smaller
systems, even though asymptotically the performance is the same. The
CUDA implementation failed at large problem sizes, I still need to
investigate that. Still, the trend of matching with OpenCL at larger
problem sizes is readily visible.
That's it for now, after some more refining I'll start with a careful
migration of the code/concepts into PETSc. Comments are, of course,
always welcome.
Best regards,
Karli
PS: Am I the only one who finds the explicit requirement of nvcc for
CUDA-code awful? Sure, it is convenient to write CUDA device code
inline, but I wasn't able to get around nvcc by using precompiled PTX
code with any reasonable effort. Is there anybody with more luck/skill
on petsc-dev?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vector.png
Type: image/png
Size: 80272 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121104/4456525e/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spmv.png
Type: image/png
Size: 113400 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121104/4456525e/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cg.png
Type: image/png
Size: 156137 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121104/4456525e/attachment-0002.png>
More information about the petsc-dev
mailing list