[petsc-dev] Slow ViennaCL performance on KSP ex12

Mon Oct 12 14:52:54 CDT 2015

On Mon, Oct 12, 2015 at 2:44 PM, Mani Chandra <mc0710 at gmail.com> wrote:
>
> On Mon, Oct 12, 2015 at 12:36 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>>
>> > On Oct 12, 2015, at 2:29 PM, Matthew Knepley <knepley at gmail.com> wrote:
>> >
>> > On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <mc0710 at gmail.com> wrote:
>> > Hi Karl,
>> >
>> > My motivation was to avoid duplicating code for the CPU and the GPU.
>> This is important considering that it takes a long time to test and make
>> sure the code produces the right results.
>> >
>> > I guess, I can add a switch in my code with something like:
>> >
>> > if (usingCPU) use VecGetArray()
>> >
>> > else if (usingGPU) use VecViennaCLGetArray()
>> >
>> > and then wrap the pointers that the above functions return with OpenCL
>> buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU and
>> CL_ALLOC_.. for GPU)
>> >
>> > Hopefully, this will avoid unnecessary data transfers.
>> >
>> > I do not understand this comment at all. This looks crazy to me. The
>> whole point of having Vec
>> > is so that no one ever ever ever ever does anything like this. I saw
>> nothing in the thread that would
>> > compel you to do this. What are you trying to accomplish with this
>> switch?
>>
>>
> I'm trying to assemble the residual needed for SNES using an OpenCL
> kernel. The kernel operates on OpenCL buffers which can either live on the
> CPU or the GPU.
>
> I think it is useful to use OpenCL on the CPU basically because of
> vectorization and vector data types. If I had to write usual C code, I'd
> have to use all sorts of pragmas in icc to get the code to vectorize and
> even then its pretty hard.
>

I would completely agree with you, if I thought the compiler actually
vectorized that code. I do not think that
is the case. Is there an example you have where you get vectorized assembly?

  Thanks,

    Matt

> Mani
>
>
>>   Matt,
>>
>>      The current OpenCL code in PETSc is hardwired for GPU usage. So the
>> correct fix, I believe, is to add to the VecViennaCL wrappers support for
>> either using the GPU or the CPU.
>>
>>    Barry
>>
>> >
>> >   Matt
>> >
>> > Cheers,
>> > Mani
>> >
>> > On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <rupp at iue.tuwien.ac.at>
>> wrote:
>> > Hi Mani,
>> >
>> > > Following http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf
>> > (page 16), I ran KSP ex12 for two cases:
>> >
>> > 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl
>> >
>> > real    0m0.213s
>> > user    0m0.206s
>> > sys     0m0.004s
>> >
>> > 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl
>> > -log_summary > log_summary_with_viennacl
>> >
>> > real    0m20.296s
>> > user    0m46.025s
>> > sys     0m1.435s
>> >
>> > The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from
>> > AMD-APP-SDK-v3.0.
>> >
>> > there are a couple of things to note here:
>> >
>> > a) The total execution time contains the OpenCL kernel compilation
>> time, which is on the order of one or two seconds. Thus, you need much
>> larger problem sizes to get a good comparison.
>> >
>> > b) Most of the execution time is spent on VecMDot, which is optimized
>> for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend
>> because one can use just plain C/C++/whatever).
>> >
>> > c) My experiences with this AMD APU are quite mixed, as I've never
>> found a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU
>> part. The integrated GPU, however, reached 80% without much effort. This is
>> particularly remarkable as both CPU and GPU share the same DDR3 memory
>> link. Thus, it is more than unlikely that you will ever beat the
>> performance of PETSc's native types.
>> >
>> >
>> >
>> > Attached are:
>> > 1) configure.log for the petsc build
>> > 2) log summary without viennacl
>> > 3) log summary with viennacl
>> > 4) OpenCL info for the system on which the runs were performed
>> >
>> > Perhaps the reason for the slow performance are superfluous copies being
>> > performed, which need not occur when running ViennaCL on the CPU.
>> > Looking at
>> >
>> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx
>> :
>> >
>> > /* Copies a vector from the CPU to the GPU unless we already have an
>> up-to-date copy on the GPU */
>> > PetscErrorCode VecViennaCLCopyToGPU(Vec v)
>> > {
>> >    PetscErrorCode ierr;
>> >
>> >    PetscFunctionBegin;
>> >    ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);
>> >    if (v->map->n > 0) {
>> >      if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {
>> >        ierr =
>> PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>> >        try {
>> >          ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;
>> >          viennacl::fast_copy(*(PetscScalar**)v->data,
>> *(PetscScalar**)v->data + v->map->n, vec->begin());
>> >          ViennaCLWaitForGPU();
>> >        } catch(std::exception const & ex) {
>> >          SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s",
>> ex.what());
>> >        }
>> >        ierr =
>> PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>> >        v->valid_GPU_array = PETSC_VIENNACL_BOTH;
>> >      }
>> >    }
>> >    PetscFunctionReturn(0);
>> > }
>> >
>> > When running ViennaCL with OpenCL on the CPU, the above function should
>> > maybe be modified?
>> >
>> > Unfortunately that is quite hard: OpenCL manages its own memory
>> handles, so 'injecting' memory into an OpenCL kernel that is not allocated
>> by the OpenCL runtime is not recommended, fairly tricky, and still involves
>> some overhead. As I see no reason to run OpenCL on a CPU, I refrained from
>> adding this extra code complexity.
>> >
>> > Overall, I recommend rerunning the benchmark on more powerful discrete
>> GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any
>> performance benefits.
>> >
>> > Hope this sheds some light on things :-)
>> >
>> > Best regards,
>> > Karli
>> >
>> >
>> >
>> >
>> >
>> > --
>> > What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> > -- Norbert Wiener
>>
>>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20151012/87e78240/attachment.html>