[petsc-dev] Slow ViennaCL performance on KSP ex12

Mon Oct 12 19:11:35 CDT 2015

On Mon, Oct 12, 2015 at 6:56 PM, Mani Chandra <mc0710 at gmail.com> wrote:

> Here is the source file:
> https://github.com/AFD-Illinois/grim/blob/opencl/computeresidual.cl
>
> Attached is the assembly code "assembly_code.asm" generated using:
>
> ioc64 -input=computeresidual.cl -bo='-DOPENCL' -device='cpu'
> -asm=assembly_code
>

The kernel is beyond enormous and I unfortunately cannot make any sense of
it.

  THanks,

     Matt

> Cheers,
> Mani
>
> Caution: The source code in the opencl branch of
> https://github.com/AFD-Illinois/grim is not very clean..
>
> On Mon, Oct 12, 2015 at 4:28 PM, Matthew Knepley <knepley at gmail.com>
> wrote:
>
>> On Mon, Oct 12, 2015 at 3:25 PM, Mani Chandra <mc0710 at gmail.com> wrote:
>>
>>> Here is the code: http://github.com/afd-illinois/grim, branch:opencl.
>>>
>>> Now using
>>>
>>> Kernel Builder for OpenCL API - compiler command line, version 1.4.0.134
>>> Copyright (C) 2014 Intel Corporation.  All rights reserved.
>>>
>>>
>>> manic at bh27:~/grim_opencl/grim> ioc64 -input=computeresidual.cl
>>> -bo='-DOPENCL'
>>> -device='cpu'
>>>
>>> No command specified, using 'build' as default
>>>
>>
>> Sorry if I am being obtuse, but I cannot find that source file in the
>> repo above. Can you give the direct link to the file?
>>
>>
>>> Using build options: -DOPENCL
>>> Setting target instruction set architecture to: Default (Advanced Vector
>>> Extension (AVX))
>>> OpenCL Intel CPU device was found!
>>> Device name: Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
>>> Device version: OpenCL 1.2 (Build 44)
>>> Device vendor: Intel(R) Corporation
>>> Device profile: FULL_PROFILE
>>> Compilation started
>>> Compilation done
>>> Linking started
>>> Linking done
>>> Device build started
>>> Device build done
>>> Kernel <ComputeResidual> was successfully vectorized
>>> Done.
>>> Build succeeded!
>>>
>>
>> It definitely says it vectorized, but what code did it generate. Can you
>> post the object file since I do not have the compiler. I
>> have seen that message with really bad code before.
>>
>>   Thanks,
>>
>>     Matt
>>
>>
>>> Cheers,
>>> Mani
>>>
>>> On Mon, Oct 12, 2015 at 12:52 PM, Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Mon, Oct 12, 2015 at 2:44 PM, Mani Chandra <mc0710 at gmail.com> wrote:
>>>>>
>>>>> On Mon, Oct 12, 2015 at 12:36 PM, Barry Smith <bsmith at mcs.anl.gov>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> > On Oct 12, 2015, at 2:29 PM, Matthew Knepley <knepley at gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <mc0710 at gmail.com>
>>>>>> wrote:
>>>>>> > Hi Karl,
>>>>>> >
>>>>>> > My motivation was to avoid duplicating code for the CPU and the
>>>>>> GPU. This is important considering that it takes a long time to test and
>>>>>> make sure the code produces the right results.
>>>>>> >
>>>>>> > I guess, I can add a switch in my code with something like:
>>>>>> >
>>>>>> > if (usingCPU) use VecGetArray()
>>>>>> >
>>>>>> > else if (usingGPU) use VecViennaCLGetArray()
>>>>>> >
>>>>>> > and then wrap the pointers that the above functions return with
>>>>>> OpenCL buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU
>>>>>> and CL_ALLOC_.. for GPU)
>>>>>> >
>>>>>> > Hopefully, this will avoid unnecessary data transfers.
>>>>>> >
>>>>>> > I do not understand this comment at all. This looks crazy to me.
>>>>>> The whole point of having Vec
>>>>>> > is so that no one ever ever ever ever does anything like this. I
>>>>>> saw nothing in the thread that would
>>>>>> > compel you to do this. What are you trying to accomplish with this
>>>>>> switch?
>>>>>>
>>>>>>
>>>>> I'm trying to assemble the residual needed for SNES using an OpenCL
>>>>> kernel. The kernel operates on OpenCL buffers which can either live on the
>>>>> CPU or the GPU.
>>>>>
>>>>> I think it is useful to use OpenCL on the CPU basically because of
>>>>> vectorization and vector data types. If I had to write usual C code, I'd
>>>>> have to use all sorts of pragmas in icc to get the code to vectorize and
>>>>> even then its pretty hard.
>>>>>
>>>>
>>>> I would completely agree with you, if I thought the compiler actually
>>>> vectorized that code. I do not think that
>>>> is the case. Is there an example you have where you get vectorized
>>>> assembly?
>>>>
>>>>   Thanks,
>>>>
>>>>     Matt
>>>>
>>>>
>>>>> Mani
>>>>>
>>>>>
>>>>>>   Matt,
>>>>>>
>>>>>>      The current OpenCL code in PETSc is hardwired for GPU usage. So
>>>>>> the correct fix, I believe, is to add to the VecViennaCL wrappers support
>>>>>> for either using the GPU or the CPU.
>>>>>>
>>>>>>    Barry
>>>>>>
>>>>>> >
>>>>>> >   Matt
>>>>>> >
>>>>>> > Cheers,
>>>>>> > Mani
>>>>>> >
>>>>>> > On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <rupp at iue.tuwien.ac.at>
>>>>>> wrote:
>>>>>> > Hi Mani,
>>>>>> >
>>>>>> > > Following
>>>>>> http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf
>>>>>> > (page 16), I ran KSP ex12 for two cases:
>>>>>> >
>>>>>> > 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl
>>>>>> >
>>>>>> > real    0m0.213s
>>>>>> > user    0m0.206s
>>>>>> > sys     0m0.004s
>>>>>> >
>>>>>> > 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl
>>>>>> > -log_summary > log_summary_with_viennacl
>>>>>> >
>>>>>> > real    0m20.296s
>>>>>> > user    0m46.025s
>>>>>> > sys     0m1.435s
>>>>>> >
>>>>>> > The runs have been performed on a CPU : AMD A10-5800K, with OpenCL
>>>>>> from
>>>>>> > AMD-APP-SDK-v3.0.
>>>>>> >
>>>>>> > there are a couple of things to note here:
>>>>>> >
>>>>>> > a) The total execution time contains the OpenCL kernel compilation
>>>>>> time, which is on the order of one or two seconds. Thus, you need much
>>>>>> larger problem sizes to get a good comparison.
>>>>>> >
>>>>>> > b) Most of the execution time is spent on VecMDot, which is
>>>>>> optimized for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL
>>>>>> backend because one can use just plain C/C++/whatever).
>>>>>> >
>>>>>> > c) My experiences with this AMD APU are quite mixed, as I've never
>>>>>> found a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU
>>>>>> part. The integrated GPU, however, reached 80% without much effort. This is
>>>>>> particularly remarkable as both CPU and GPU share the same DDR3 memory
>>>>>> link. Thus, it is more than unlikely that you will ever beat the
>>>>>> performance of PETSc's native types.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Attached are:
>>>>>> > 1) configure.log for the petsc build
>>>>>> > 2) log summary without viennacl
>>>>>> > 3) log summary with viennacl
>>>>>> > 4) OpenCL info for the system on which the runs were performed
>>>>>> >
>>>>>> > Perhaps the reason for the slow performance are superfluous copies
>>>>>> being
>>>>>> > performed, which need not occur when running ViennaCL on the CPU.
>>>>>> > Looking at
>>>>>> >
>>>>>> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx
>>>>>> :
>>>>>> >
>>>>>> > /* Copies a vector from the CPU to the GPU unless we already have
>>>>>> an up-to-date copy on the GPU */
>>>>>> > PetscErrorCode VecViennaCLCopyToGPU(Vec v)
>>>>>> > {
>>>>>> >    PetscErrorCode ierr;
>>>>>> >
>>>>>> >    PetscFunctionBegin;
>>>>>> >    ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);
>>>>>> >    if (v->map->n > 0) {
>>>>>> >      if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {
>>>>>> >        ierr =
>>>>>> PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>>>>>> >        try {
>>>>>> >          ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;
>>>>>> >          viennacl::fast_copy(*(PetscScalar**)v->data,
>>>>>> *(PetscScalar**)v->data + v->map->n, vec->begin());
>>>>>> >          ViennaCLWaitForGPU();
>>>>>> >        } catch(std::exception const & ex) {
>>>>>> >          SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error:
>>>>>> %s", ex.what());
>>>>>> >        }
>>>>>> >        ierr =
>>>>>> PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>>>>>> >        v->valid_GPU_array = PETSC_VIENNACL_BOTH;
>>>>>> >      }
>>>>>> >    }
>>>>>> >    PetscFunctionReturn(0);
>>>>>> > }
>>>>>> >
>>>>>> > When running ViennaCL with OpenCL on the CPU, the above function
>>>>>> should
>>>>>> > maybe be modified?
>>>>>> >
>>>>>> > Unfortunately that is quite hard: OpenCL manages its own memory
>>>>>> handles, so 'injecting' memory into an OpenCL kernel that is not allocated
>>>>>> by the OpenCL runtime is not recommended, fairly tricky, and still involves
>>>>>> some overhead. As I see no reason to run OpenCL on a CPU, I refrained from
>>>>>> adding this extra code complexity.
>>>>>> >
>>>>>> > Overall, I recommend rerunning the benchmark on more powerful
>>>>>> discrete GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any
>>>>>> performance benefits.
>>>>>> >
>>>>>> > Hope this sheds some light on things :-)
>>>>>> >
>>>>>> > Best regards,
>>>>>> > Karli
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > What most experimenters take for granted before they begin their
>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>> experiments lead.
>>>>>> > -- Norbert Wiener
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20151012/3fa5c52b/attachment.html>