[petsc-dev] Slow ViennaCL performance on KSP ex12

Mon Oct 12 15:25:31 CDT 2015

Here is the code: http://github.com/afd-illinois/grim, branch:opencl.

Now using

manic at bh27:~/grim_opencl/grim> ioc64 -input=computeresidual.cl
-bo='-DOPENCL'
-device='cpu'

No command specified, using 'build' as default

Using build options: -DOPENCL
Setting target instruction set architecture to: Default (Advanced Vector
Extension (AVX))
OpenCL Intel CPU device was found!
Device name: Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
Device version: OpenCL 1.2 (Build 44)
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <ComputeResidual> was successfully vectorized
Done.
Build succeeded!

Cheers,
Mani

On Mon, Oct 12, 2015 at 12:52 PM, Matthew Knepley <knepley at gmail.com> wrote:

> On Mon, Oct 12, 2015 at 2:44 PM, Mani Chandra <mc0710 at gmail.com> wrote:
>>
>> On Mon, Oct 12, 2015 at 12:36 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>
>>>
>>> > On Oct 12, 2015, at 2:29 PM, Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>> >
>>> > On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <mc0710 at gmail.com>
>>> wrote:
>>> > Hi Karl,
>>> >
>>> > My motivation was to avoid duplicating code for the CPU and the GPU.
>>> This is important considering that it takes a long time to test and make
>>> sure the code produces the right results.
>>> >
>>> > I guess, I can add a switch in my code with something like:
>>> >
>>> > if (usingCPU) use VecGetArray()
>>> >
>>> > else if (usingGPU) use VecViennaCLGetArray()
>>> >
>>> > and then wrap the pointers that the above functions return with OpenCL
>>> buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU and
>>> CL_ALLOC_.. for GPU)
>>> >
>>> > Hopefully, this will avoid unnecessary data transfers.
>>> >
>>> > I do not understand this comment at all. This looks crazy to me. The
>>> whole point of having Vec
>>> > is so that no one ever ever ever ever does anything like this. I saw
>>> nothing in the thread that would
>>> > compel you to do this. What are you trying to accomplish with this
>>> switch?
>>>
>>>
>> I'm trying to assemble the residual needed for SNES using an OpenCL
>> kernel. The kernel operates on OpenCL buffers which can either live on the
>> CPU or the GPU.
>>
>> I think it is useful to use OpenCL on the CPU basically because of
>> vectorization and vector data types. If I had to write usual C code, I'd
>> have to use all sorts of pragmas in icc to get the code to vectorize and
>> even then its pretty hard.
>>
>
> I would completely agree with you, if I thought the compiler actually
> vectorized that code. I do not think that
> is the case. Is there an example you have where you get vectorized
> assembly?
>
>   Thanks,
>
>     Matt
>
>
>> Mani
>>
>>
>>>   Matt,
>>>
>>>      The current OpenCL code in PETSc is hardwired for GPU usage. So the
>>> correct fix, I believe, is to add to the VecViennaCL wrappers support for
>>> either using the GPU or the CPU.
>>>
>>>    Barry
>>>
>>> >
>>> >   Matt
>>> >
>>> > Cheers,
>>> > Mani
>>> >
>>> > On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <rupp at iue.tuwien.ac.at>
>>> wrote:
>>> > Hi Mani,
>>> >
>>> > > Following
>>> http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf
>>> > (page 16), I ran KSP ex12 for two cases:
>>> >
>>> > 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl
>>> >
>>> > real    0m0.213s
>>> > user    0m0.206s
>>> > sys     0m0.004s
>>> >
>>> > 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl
>>> > -log_summary > log_summary_with_viennacl
>>> >
>>> > real    0m20.296s
>>> > user    0m46.025s
>>> > sys     0m1.435s
>>> >
>>> > The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from
>>> > AMD-APP-SDK-v3.0.
>>> >
>>> > there are a couple of things to note here:
>>> >
>>> > a) The total execution time contains the OpenCL kernel compilation
>>> time, which is on the order of one or two seconds. Thus, you need much
>>> larger problem sizes to get a good comparison.
>>> >
>>> > b) Most of the execution time is spent on VecMDot, which is optimized
>>> for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend
>>> because one can use just plain C/C++/whatever).
>>> >
>>> > c) My experiences with this AMD APU are quite mixed, as I've never
>>> found a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU
>>> part. The integrated GPU, however, reached 80% without much effort. This is
>>> particularly remarkable as both CPU and GPU share the same DDR3 memory
>>> link. Thus, it is more than unlikely that you will ever beat the
>>> performance of PETSc's native types.
>>> >
>>> >
>>> >
>>> > Attached are:
>>> > 1) configure.log for the petsc build
>>> > 2) log summary without viennacl
>>> > 3) log summary with viennacl
>>> > 4) OpenCL info for the system on which the runs were performed
>>> >
>>> > Perhaps the reason for the slow performance are superfluous copies
>>> being
>>> > performed, which need not occur when running ViennaCL on the CPU.
>>> > Looking at
>>> >
>>> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx
>>> :
>>> >
>>> > /* Copies a vector from the CPU to the GPU unless we already have an
>>> up-to-date copy on the GPU */
>>> > PetscErrorCode VecViennaCLCopyToGPU(Vec v)
>>> > {
>>> >    PetscErrorCode ierr;
>>> >
>>> >    PetscFunctionBegin;
>>> >    ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);
>>> >    if (v->map->n > 0) {
>>> >      if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {
>>> >        ierr =
>>> PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>>> >        try {
>>> >          ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;
>>> >          viennacl::fast_copy(*(PetscScalar**)v->data,
>>> *(PetscScalar**)v->data + v->map->n, vec->begin());
>>> >          ViennaCLWaitForGPU();
>>> >        } catch(std::exception const & ex) {
>>> >          SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s",
>>> ex.what());
>>> >        }
>>> >        ierr =
>>> PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>>> >        v->valid_GPU_array = PETSC_VIENNACL_BOTH;
>>> >      }
>>> >    }
>>> >    PetscFunctionReturn(0);
>>> > }
>>> >
>>> > When running ViennaCL with OpenCL on the CPU, the above function should
>>> > maybe be modified?
>>> >
>>> > Unfortunately that is quite hard: OpenCL manages its own memory
>>> handles, so 'injecting' memory into an OpenCL kernel that is not allocated
>>> by the OpenCL runtime is not recommended, fairly tricky, and still involves
>>> some overhead. As I see no reason to run OpenCL on a CPU, I refrained from
>>> adding this extra code complexity.
>>> >
>>> > Overall, I recommend rerunning the benchmark on more powerful discrete
>>> GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any
>>> performance benefits.
>>> >
>>> > Hope this sheds some light on things :-)
>>> >
>>> > Best regards,
>>> > Karli
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> > -- Norbert Wiener
>>>
>>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20151012/134a2355/attachment.html>