[petsc-dev] Slow ViennaCL performance on KSP ex12

Mani Chandra mc0710 at gmail.com
Mon Oct 12 14:13:59 CDT 2015


Hi Karl,

My motivation was to avoid duplicating code for the CPU and the GPU. This
is important considering that it takes a long time to test and make sure
the code produces the right results.

I guess, I can add a switch in my code with something like:

if (usingCPU) use VecGetArray()

else if (usingGPU) use VecViennaCLGetArray()

and then wrap the pointers that the above functions return with OpenCL
buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU and
CL_ALLOC_.. for GPU)

Hopefully, this will avoid unnecessary data transfers.

Cheers,
Mani

On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <rupp at iue.tuwien.ac.at> wrote:

> Hi Mani,
>
> > Following http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf
>
>> (page 16), I ran KSP ex12 for two cases:
>>
>> 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl
>>
>> real    0m0.213s
>> user    0m0.206s
>> sys     0m0.004s
>>
>> 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl
>> -log_summary > log_summary_with_viennacl
>>
>> real    0m20.296s
>> user    0m46.025s
>> sys     0m1.435s
>>
>> The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from
>> AMD-APP-SDK-v3.0.
>>
>
> there are a couple of things to note here:
>
> a) The total execution time contains the OpenCL kernel compilation time,
> which is on the order of one or two seconds. Thus, you need much larger
> problem sizes to get a good comparison.
>
> b) Most of the execution time is spent on VecMDot, which is optimized for
> GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend
> because one can use just plain C/C++/whatever).
>
> c) My experiences with this AMD APU are quite mixed, as I've never found a
> way to get more than 45% of STREAM bandwidth with OpenCL on the CPU part.
> The integrated GPU, however, reached 80% without much effort. This is
> particularly remarkable as both CPU and GPU share the same DDR3 memory
> link. Thus, it is more than unlikely that you will ever beat the
> performance of PETSc's native types.
>
>
>
> Attached are:
>> 1) configure.log for the petsc build
>> 2) log summary without viennacl
>> 3) log summary with viennacl
>> 4) OpenCL info for the system on which the runs were performed
>>
>> Perhaps the reason for the slow performance are superfluous copies being
>> performed, which need not occur when running ViennaCL on the CPU.
>> Looking at
>>
>> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx
>> :
>>
>> /* Copies a vector from the CPU to the GPU unless we already have an
>> up-to-date copy on the GPU */
>> PetscErrorCode VecViennaCLCopyToGPU(Vec v)
>> {
>>    PetscErrorCode ierr;
>>
>>    PetscFunctionBegin;
>>    ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);
>>    if (v->map->n > 0) {
>>      if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {
>>        ierr =
>> PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>>        try {
>>          ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;
>>          viennacl::fast_copy(*(PetscScalar**)v->data,
>> *(PetscScalar**)v->data + v->map->n, vec->begin());
>>          ViennaCLWaitForGPU();
>>        } catch(std::exception const & ex) {
>>          SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s",
>> ex.what());
>>        }
>>        ierr =
>> PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>>        v->valid_GPU_array = PETSC_VIENNACL_BOTH;
>>      }
>>    }
>>    PetscFunctionReturn(0);
>> }
>>
>> When running ViennaCL with OpenCL on the CPU, the above function should
>> maybe be modified?
>>
>
> Unfortunately that is quite hard: OpenCL manages its own memory handles,
> so 'injecting' memory into an OpenCL kernel that is not allocated by the
> OpenCL runtime is not recommended, fairly tricky, and still involves some
> overhead. As I see no reason to run OpenCL on a CPU, I refrained from
> adding this extra code complexity.
>
> Overall, I recommend rerunning the benchmark on more powerful discrete
> GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any
> performance benefits.
>
> Hope this sheds some light on things :-)
>
> Best regards,
> Karli
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20151012/b03892df/attachment.html>


More information about the petsc-dev mailing list