[petsc-dev] Slow ViennaCL performance on KSP ex12

Mon Oct 12 16:34:11 CDT 2015

Hi,

 >      > My motivation was to avoid duplicating code for the CPU and the
>     GPU. This is important considering that it takes a long time to test
>     and make sure the code produces the right results.
>      >
>      > I guess, I can add a switch in my code with something like:
>      >
>      > if (usingCPU) use VecGetArray()
>      >
>      > else if (usingGPU) use VecViennaCLGetArray()
>      >
>      > and then wrap the pointers that the above functions return with
>     OpenCL buffers with the appropriate memory flags (CL_USE_HOST_PTR
>     for CPU and CL_ALLOC_.. for GPU)
>      >
>      > Hopefully, this will avoid unnecessary data transfers.
>      >
>      > I do not understand this comment at all. This looks crazy to me.
>     The whole point of having Vec
>      > is so that no one ever ever ever ever does anything like this. I
>     saw nothing in the thread that would
>      > compel you to do this. What are you trying to accomplish with
>     this switch?
>
>        Matt,
>
>           The current OpenCL code in PETSc is hardwired for GPU usage.
>     So the correct fix, I believe, is to add to the VecViennaCL wrappers
>     support for either using the GPU or the CPU.
>
>
> Yes, that is an option.
>
> I thought the upshot of Karl's mail was that while this is possible,
> OpenCL CPU performance is woeful and unlikely to improve, and
> a better option is to use the current code with multiple MPI processes
> and the PETSc type mechanism.

Yes, from all the time I've spent with OpenCL on the CPU I could only 
conclude that it is better to use the native PETSc Vec for CPU-based 
execution. Incomplete list of reasons:
  - Vectorization with OpenCL is not much easier than with plain C. Yes, 
OpenCL has vector datatypes, but the compiler might do different kinds 
of dirty witchcraft to try to vectorize over work items and ultimately 
fail utterly. You might be just better off by using intrinsics to have 
explicit control over what you want to (and can) vectorize.
  - Workgroup sizes are sometimes hard to get right on the CPU. The 
various SDKs (and worse, the different versions of the SDKs) have 
different ways of mapping OpenCL work items to hardware.
  - Performance on Xeon Phis is disastrous.
  - No good notion of data locality (same problems as OpenMP).
  - Code for GPUs and CPUs is different even when using a common OpenCL 
kernel language. You don't get performance portability for free, unless 
you only have only simple kernels.

Mani, the best way to deal with CPU vs. GPU (imho) indeed is to have two 
different implementations. Whether you dispatch based on if/else in the 
residual evaluation or whether you register different function pointers 
for SNES is up to you. :-)

Best regards,
Karli