[petsc-dev] Slow ViennaCL performance on KSP ex12

Mani Chandra mc0710 at gmail.com
Mon Oct 12 16:56:04 CDT 2015


Hi Karl,

Thanks for your input.

I've actually had no trouble getting OpenCL code to vectorize on Intel
platforms. Their support (I only tested on CPUs) seems to be pretty good. I
had to try quite hard to vectorize usual C code, and even then it was quite
fragile.

I agree completely with you that a single kernel might not give the best
performance for both CPUs and GPUs. I'm only following general guidelines
though, like tile blocking by using local memory (may not be ideal for CPUs
actually) and ensuring contiguous memory accesses, making sure the pointer
starting addresses are aligned to 64 bytes etc..

Cheers,
Mani


On Mon, Oct 12, 2015 at 2:34 PM, Karl Rupp <rupp at iue.tuwien.ac.at> wrote:

> Hi,
>
>
> >      > My motivation was to avoid duplicating code for the CPU and the
>
>>     GPU. This is important considering that it takes a long time to test
>>     and make sure the code produces the right results.
>>      >
>>      > I guess, I can add a switch in my code with something like:
>>      >
>>      > if (usingCPU) use VecGetArray()
>>      >
>>      > else if (usingGPU) use VecViennaCLGetArray()
>>      >
>>      > and then wrap the pointers that the above functions return with
>>     OpenCL buffers with the appropriate memory flags (CL_USE_HOST_PTR
>>     for CPU and CL_ALLOC_.. for GPU)
>>      >
>>      > Hopefully, this will avoid unnecessary data transfers.
>>      >
>>      > I do not understand this comment at all. This looks crazy to me.
>>     The whole point of having Vec
>>      > is so that no one ever ever ever ever does anything like this. I
>>     saw nothing in the thread that would
>>      > compel you to do this. What are you trying to accomplish with
>>     this switch?
>>
>>        Matt,
>>
>>           The current OpenCL code in PETSc is hardwired for GPU usage.
>>     So the correct fix, I believe, is to add to the VecViennaCL wrappers
>>     support for either using the GPU or the CPU.
>>
>>
>> Yes, that is an option.
>>
>> I thought the upshot of Karl's mail was that while this is possible,
>> OpenCL CPU performance is woeful and unlikely to improve, and
>> a better option is to use the current code with multiple MPI processes
>> and the PETSc type mechanism.
>>
>
>
> Yes, from all the time I've spent with OpenCL on the CPU I could only
> conclude that it is better to use the native PETSc Vec for CPU-based
> execution. Incomplete list of reasons:
>  - Vectorization with OpenCL is not much easier than with plain C. Yes,
> OpenCL has vector datatypes, but the compiler might do different kinds of
> dirty witchcraft to try to vectorize over work items and ultimately fail
> utterly. You might be just better off by using intrinsics to have explicit
> control over what you want to (and can) vectorize.
>  - Workgroup sizes are sometimes hard to get right on the CPU. The various
> SDKs (and worse, the different versions of the SDKs) have different ways of
> mapping OpenCL work items to hardware.
>  - Performance on Xeon Phis is disastrous.
>  - No good notion of data locality (same problems as OpenMP).
>  - Code for GPUs and CPUs is different even when using a common OpenCL
> kernel language. You don't get performance portability for free, unless you
> only have only simple kernels.
>
> Mani, the best way to deal with CPU vs. GPU (imho) indeed is to have two
> different implementations. Whether you dispatch based on if/else in the
> residual evaluation or whether you register different function pointers for
> SNES is up to you. :-)
>
> Best regards,
> Karli
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20151012/0467a33d/attachment.html>


More information about the petsc-dev mailing list