[petsc-dev] Slow ViennaCL performance on KSP ex12

Mon Oct 12 14:41:34 CDT 2015

On Mon, Oct 12, 2015 at 2:36 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
> > On Oct 12, 2015, at 2:29 PM, Matthew Knepley <knepley at gmail.com> wrote:
> >
> > On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <mc0710 at gmail.com> wrote:
> > Hi Karl,
> >
> > My motivation was to avoid duplicating code for the CPU and the GPU.
> This is important considering that it takes a long time to test and make
> sure the code produces the right results.
> >
> > I guess, I can add a switch in my code with something like:
> >
> > if (usingCPU) use VecGetArray()
> >
> > else if (usingGPU) use VecViennaCLGetArray()
> >
> > and then wrap the pointers that the above functions return with OpenCL
> buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU and
> CL_ALLOC_.. for GPU)
> >
> > Hopefully, this will avoid unnecessary data transfers.
> >
> > I do not understand this comment at all. This looks crazy to me. The
> whole point of having Vec
> > is so that no one ever ever ever ever does anything like this. I saw
> nothing in the thread that would
> > compel you to do this. What are you trying to accomplish with this
> switch?
>
>   Matt,
>
>      The current OpenCL code in PETSc is hardwired for GPU usage. So the
> correct fix, I believe, is to add to the VecViennaCL wrappers support for
> either using the GPU or the CPU.
>

Yes, that is an option.

I thought the upshot of Karl's mail was that while this is possible, OpenCL
CPU performance is woeful and unlikely to improve, and
a better option is to use the current code with multiple MPI processes and
the PETSc type mechanism.

  Matt

>    Barry
>
> >
> >   Matt
> >
> > Cheers,
> > Mani
> >
> > On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <rupp at iue.tuwien.ac.at>
> wrote:
> > Hi Mani,
> >
> > > Following http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf
> > (page 16), I ran KSP ex12 for two cases:
> >
> > 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl
> >
> > real    0m0.213s
> > user    0m0.206s
> > sys     0m0.004s
> >
> > 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl
> > -log_summary > log_summary_with_viennacl
> >
> > real    0m20.296s
> > user    0m46.025s
> > sys     0m1.435s
> >
> > The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from
> > AMD-APP-SDK-v3.0.
> >
> > there are a couple of things to note here:
> >
> > a) The total execution time contains the OpenCL kernel compilation time,
> which is on the order of one or two seconds. Thus, you need much larger
> problem sizes to get a good comparison.
> >
> > b) Most of the execution time is spent on VecMDot, which is optimized
> for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend
> because one can use just plain C/C++/whatever).
> >
> > c) My experiences with this AMD APU are quite mixed, as I've never found
> a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU part.
> The integrated GPU, however, reached 80% without much effort. This is
> particularly remarkable as both CPU and GPU share the same DDR3 memory
> link. Thus, it is more than unlikely that you will ever beat the
> performance of PETSc's native types.
> >
> >
> >
> > Attached are:
> > 1) configure.log for the petsc build
> > 2) log summary without viennacl
> > 3) log summary with viennacl
> > 4) OpenCL info for the system on which the runs were performed
> >
> > Perhaps the reason for the slow performance are superfluous copies being
> > performed, which need not occur when running ViennaCL on the CPU.
> > Looking at
> >
> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx
> :
> >
> > /* Copies a vector from the CPU to the GPU unless we already have an
> up-to-date copy on the GPU */
> > PetscErrorCode VecViennaCLCopyToGPU(Vec v)
> > {
> >    PetscErrorCode ierr;
> >
> >    PetscFunctionBegin;
> >    ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);
> >    if (v->map->n > 0) {
> >      if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {
> >        ierr =
> PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
> >        try {
> >          ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;
> >          viennacl::fast_copy(*(PetscScalar**)v->data,
> *(PetscScalar**)v->data + v->map->n, vec->begin());
> >          ViennaCLWaitForGPU();
> >        } catch(std::exception const & ex) {
> >          SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s",
> ex.what());
> >        }
> >        ierr =
> PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
> >        v->valid_GPU_array = PETSC_VIENNACL_BOTH;
> >      }
> >    }
> >    PetscFunctionReturn(0);
> > }
> >
> > When running ViennaCL with OpenCL on the CPU, the above function should
> > maybe be modified?
> >
> > Unfortunately that is quite hard: OpenCL manages its own memory handles,
> so 'injecting' memory into an OpenCL kernel that is not allocated by the
> OpenCL runtime is not recommended, fairly tricky, and still involves some
> overhead. As I see no reason to run OpenCL on a CPU, I refrained from
> adding this extra code complexity.
> >
> > Overall, I recommend rerunning the benchmark on more powerful discrete
> GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any
> performance benefits.
> >
> > Hope this sheds some light on things :-)
> >
> > Best regards,
> > Karli
> >
> >
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> > -- Norbert Wiener
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20151012/1710e409/attachment.html>