[petsc-dev] Slow ViennaCL performance on KSP ex12

Mon Oct 12 14:29:27 CDT 2015

On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <mc0710 at gmail.com> wrote:

> Hi Karl,
>
> My motivation was to avoid duplicating code for the CPU and the GPU. This
> is important considering that it takes a long time to test and make sure
> the code produces the right results.
>
> I guess, I can add a switch in my code with something like:
>
> if (usingCPU) use VecGetArray()
>
> else if (usingGPU) use VecViennaCLGetArray()
>
> and then wrap the pointers that the above functions return with OpenCL
> buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU and
> CL_ALLOC_.. for GPU)
>
> Hopefully, this will avoid unnecessary data transfers.
>

I do not understand this comment at all. This looks crazy to me. The whole
point of having Vec
is so that no one ever ever ever ever does anything like this. I saw
nothing in the thread that would
compel you to do this. What are you trying to accomplish with this switch?

  Matt

> Cheers,
> Mani
>
> On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <rupp at iue.tuwien.ac.at> wrote:
>
>> Hi Mani,
>>
>> > Following http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf
>>
>>> (page 16), I ran KSP ex12 for two cases:
>>>
>>> 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl
>>>
>>> real    0m0.213s
>>> user    0m0.206s
>>> sys     0m0.004s
>>>
>>> 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl
>>> -log_summary > log_summary_with_viennacl
>>>
>>> real    0m20.296s
>>> user    0m46.025s
>>> sys     0m1.435s
>>>
>>> The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from
>>> AMD-APP-SDK-v3.0.
>>>
>>
>> there are a couple of things to note here:
>>
>> a) The total execution time contains the OpenCL kernel compilation time,
>> which is on the order of one or two seconds. Thus, you need much larger
>> problem sizes to get a good comparison.
>>
>> b) Most of the execution time is spent on VecMDot, which is optimized for
>> GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend
>> because one can use just plain C/C++/whatever).
>>
>> c) My experiences with this AMD APU are quite mixed, as I've never found
>> a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU part.
>> The integrated GPU, however, reached 80% without much effort. This is
>> particularly remarkable as both CPU and GPU share the same DDR3 memory
>> link. Thus, it is more than unlikely that you will ever beat the
>> performance of PETSc's native types.
>>
>>
>>
>> Attached are:
>>> 1) configure.log for the petsc build
>>> 2) log summary without viennacl
>>> 3) log summary with viennacl
>>> 4) OpenCL info for the system on which the runs were performed
>>>
>>> Perhaps the reason for the slow performance are superfluous copies being
>>> performed, which need not occur when running ViennaCL on the CPU.
>>> Looking at
>>>
>>> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx
>>> :
>>>
>>> /* Copies a vector from the CPU to the GPU unless we already have an
>>> up-to-date copy on the GPU */
>>> PetscErrorCode VecViennaCLCopyToGPU(Vec v)
>>> {
>>>    PetscErrorCode ierr;
>>>
>>>    PetscFunctionBegin;
>>>    ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);
>>>    if (v->map->n > 0) {
>>>      if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {
>>>        ierr =
>>> PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>>>        try {
>>>          ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;
>>>          viennacl::fast_copy(*(PetscScalar**)v->data,
>>> *(PetscScalar**)v->data + v->map->n, vec->begin());
>>>          ViennaCLWaitForGPU();
>>>        } catch(std::exception const & ex) {
>>>          SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s",
>>> ex.what());
>>>        }
>>>        ierr =
>>> PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>>>        v->valid_GPU_array = PETSC_VIENNACL_BOTH;
>>>      }
>>>    }
>>>    PetscFunctionReturn(0);
>>> }
>>>
>>> When running ViennaCL with OpenCL on the CPU, the above function should
>>> maybe be modified?
>>>
>>
>> Unfortunately that is quite hard: OpenCL manages its own memory handles,
>> so 'injecting' memory into an OpenCL kernel that is not allocated by the
>> OpenCL runtime is not recommended, fairly tricky, and still involves some
>> overhead. As I see no reason to run OpenCL on a CPU, I refrained from
>> adding this extra code complexity.
>>
>> Overall, I recommend rerunning the benchmark on more powerful discrete
>> GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any
>> performance benefits.
>>
>> Hope this sheds some light on things :-)
>>
>> Best regards,
>> Karli
>>
>>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20151012/4a27ff43/attachment.html>