[petsc-dev] Slow ViennaCL performance on KSP ex12

Mon Oct 12 18:56:05 CDT 2015

Here is the source file:
https://github.com/AFD-Illinois/grim/blob/opencl/computeresidual.cl

Attached is the assembly code "assembly_code.asm" generated using:

ioc64 -input=computeresidual.cl -bo='-DOPENCL' -device='cpu'
-asm=assembly_code

Cheers,
Mani

Caution: The source code in the opencl branch of
https://github.com/AFD-Illinois/grim is not very clean..

On Mon, Oct 12, 2015 at 4:28 PM, Matthew Knepley <knepley at gmail.com> wrote:

> On Mon, Oct 12, 2015 at 3:25 PM, Mani Chandra <mc0710 at gmail.com> wrote:
>
>> Here is the code: http://github.com/afd-illinois/grim, branch:opencl.
>>
>> Now using
>>
>> Kernel Builder for OpenCL API - compiler command line, version 1.4.0.134
>> Copyright (C) 2014 Intel Corporation.  All rights reserved.
>>
>>
>> manic at bh27:~/grim_opencl/grim> ioc64 -input=computeresidual.cl
>> -bo='-DOPENCL'
>> -device='cpu'
>>
>> No command specified, using 'build' as default
>>
>
> Sorry if I am being obtuse, but I cannot find that source file in the repo
> above. Can you give the direct link to the file?
>
>
>> Using build options: -DOPENCL
>> Setting target instruction set architecture to: Default (Advanced Vector
>> Extension (AVX))
>> OpenCL Intel CPU device was found!
>> Device name: Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
>> Device version: OpenCL 1.2 (Build 44)
>> Device vendor: Intel(R) Corporation
>> Device profile: FULL_PROFILE
>> Compilation started
>> Compilation done
>> Linking started
>> Linking done
>> Device build started
>> Device build done
>> Kernel <ComputeResidual> was successfully vectorized
>> Done.
>> Build succeeded!
>>
>
> It definitely says it vectorized, but what code did it generate. Can you
> post the object file since I do not have the compiler. I
> have seen that message with really bad code before.
>
>   Thanks,
>
>     Matt
>
>
>> Cheers,
>> Mani
>>
>> On Mon, Oct 12, 2015 at 12:52 PM, Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Mon, Oct 12, 2015 at 2:44 PM, Mani Chandra <mc0710 at gmail.com> wrote:
>>>>
>>>> On Mon, Oct 12, 2015 at 12:36 PM, Barry Smith <bsmith at mcs.anl.gov>
>>>> wrote:
>>>>
>>>>>
>>>>> > On Oct 12, 2015, at 2:29 PM, Matthew Knepley <knepley at gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <mc0710 at gmail.com>
>>>>> wrote:
>>>>> > Hi Karl,
>>>>> >
>>>>> > My motivation was to avoid duplicating code for the CPU and the GPU.
>>>>> This is important considering that it takes a long time to test and make
>>>>> sure the code produces the right results.
>>>>> >
>>>>> > I guess, I can add a switch in my code with something like:
>>>>> >
>>>>> > if (usingCPU) use VecGetArray()
>>>>> >
>>>>> > else if (usingGPU) use VecViennaCLGetArray()
>>>>> >
>>>>> > and then wrap the pointers that the above functions return with
>>>>> OpenCL buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU
>>>>> and CL_ALLOC_.. for GPU)
>>>>> >
>>>>> > Hopefully, this will avoid unnecessary data transfers.
>>>>> >
>>>>> > I do not understand this comment at all. This looks crazy to me. The
>>>>> whole point of having Vec
>>>>> > is so that no one ever ever ever ever does anything like this. I saw
>>>>> nothing in the thread that would
>>>>> > compel you to do this. What are you trying to accomplish with this
>>>>> switch?
>>>>>
>>>>>
>>>> I'm trying to assemble the residual needed for SNES using an OpenCL
>>>> kernel. The kernel operates on OpenCL buffers which can either live on the
>>>> CPU or the GPU.
>>>>
>>>> I think it is useful to use OpenCL on the CPU basically because of
>>>> vectorization and vector data types. If I had to write usual C code, I'd
>>>> have to use all sorts of pragmas in icc to get the code to vectorize and
>>>> even then its pretty hard.
>>>>
>>>
>>> I would completely agree with you, if I thought the compiler actually
>>> vectorized that code. I do not think that
>>> is the case. Is there an example you have where you get vectorized
>>> assembly?
>>>
>>>   Thanks,
>>>
>>>     Matt
>>>
>>>
>>>> Mani
>>>>
>>>>
>>>>>   Matt,
>>>>>
>>>>>      The current OpenCL code in PETSc is hardwired for GPU usage. So
>>>>> the correct fix, I believe, is to add to the VecViennaCL wrappers support
>>>>> for either using the GPU or the CPU.
>>>>>
>>>>>    Barry
>>>>>
>>>>> >
>>>>> >   Matt
>>>>> >
>>>>> > Cheers,
>>>>> > Mani
>>>>> >
>>>>> > On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <rupp at iue.tuwien.ac.at>
>>>>> wrote:
>>>>> > Hi Mani,
>>>>> >
>>>>> > > Following
>>>>> http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf
>>>>> > (page 16), I ran KSP ex12 for two cases:
>>>>> >
>>>>> > 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl
>>>>> >
>>>>> > real    0m0.213s
>>>>> > user    0m0.206s
>>>>> > sys     0m0.004s
>>>>> >
>>>>> > 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl
>>>>> > -log_summary > log_summary_with_viennacl
>>>>> >
>>>>> > real    0m20.296s
>>>>> > user    0m46.025s
>>>>> > sys     0m1.435s
>>>>> >
>>>>> > The runs have been performed on a CPU : AMD A10-5800K, with OpenCL
>>>>> from
>>>>> > AMD-APP-SDK-v3.0.
>>>>> >
>>>>> > there are a couple of things to note here:
>>>>> >
>>>>> > a) The total execution time contains the OpenCL kernel compilation
>>>>> time, which is on the order of one or two seconds. Thus, you need much
>>>>> larger problem sizes to get a good comparison.
>>>>> >
>>>>> > b) Most of the execution time is spent on VecMDot, which is
>>>>> optimized for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL
>>>>> backend because one can use just plain C/C++/whatever).
>>>>> >
>>>>> > c) My experiences with this AMD APU are quite mixed, as I've never
>>>>> found a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU
>>>>> part. The integrated GPU, however, reached 80% without much effort. This is
>>>>> particularly remarkable as both CPU and GPU share the same DDR3 memory
>>>>> link. Thus, it is more than unlikely that you will ever beat the
>>>>> performance of PETSc's native types.
>>>>> >
>>>>> >
>>>>> >
>>>>> > Attached are:
>>>>> > 1) configure.log for the petsc build
>>>>> > 2) log summary without viennacl
>>>>> > 3) log summary with viennacl
>>>>> > 4) OpenCL info for the system on which the runs were performed
>>>>> >
>>>>> > Perhaps the reason for the slow performance are superfluous copies
>>>>> being
>>>>> > performed, which need not occur when running ViennaCL on the CPU.
>>>>> > Looking at
>>>>> >
>>>>> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx
>>>>> :
>>>>> >
>>>>> > /* Copies a vector from the CPU to the GPU unless we already have an
>>>>> up-to-date copy on the GPU */
>>>>> > PetscErrorCode VecViennaCLCopyToGPU(Vec v)
>>>>> > {
>>>>> >    PetscErrorCode ierr;
>>>>> >
>>>>> >    PetscFunctionBegin;
>>>>> >    ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);
>>>>> >    if (v->map->n > 0) {
>>>>> >      if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {
>>>>> >        ierr =
>>>>> PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>>>>> >        try {
>>>>> >          ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;
>>>>> >          viennacl::fast_copy(*(PetscScalar**)v->data,
>>>>> *(PetscScalar**)v->data + v->map->n, vec->begin());
>>>>> >          ViennaCLWaitForGPU();
>>>>> >        } catch(std::exception const & ex) {
>>>>> >          SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error:
>>>>> %s", ex.what());
>>>>> >        }
>>>>> >        ierr =
>>>>> PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>>>>> >        v->valid_GPU_array = PETSC_VIENNACL_BOTH;
>>>>> >      }
>>>>> >    }
>>>>> >    PetscFunctionReturn(0);
>>>>> > }
>>>>> >
>>>>> > When running ViennaCL with OpenCL on the CPU, the above function
>>>>> should
>>>>> > maybe be modified?
>>>>> >
>>>>> > Unfortunately that is quite hard: OpenCL manages its own memory
>>>>> handles, so 'injecting' memory into an OpenCL kernel that is not allocated
>>>>> by the OpenCL runtime is not recommended, fairly tricky, and still involves
>>>>> some overhead. As I see no reason to run OpenCL on a CPU, I refrained from
>>>>> adding this extra code complexity.
>>>>> >
>>>>> > Overall, I recommend rerunning the benchmark on more powerful
>>>>> discrete GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any
>>>>> performance benefits.
>>>>> >
>>>>> > Hope this sheds some light on things :-)
>>>>> >
>>>>> > Best regards,
>>>>> > Karli
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > What most experimenters take for granted before they begin their
>>>>> experiments is infinitely more interesting than any results to which their
>>>>> experiments lead.
>>>>> > -- Norbert Wiener
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20151012/cef494e5/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: assembly_code.asm
Type: application/octet-stream
Size: 3101678 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20151012/cef494e5/attachment.obj>