[petsc-dev] Slow ViennaCL performance on KSP ex12

Sun Oct 11 15:14:37 CDT 2015

Hi Mani,

 > Following http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf
> (page 16), I ran KSP ex12 for two cases:
>
> 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl
>
> real    0m0.213s
> user    0m0.206s
> sys     0m0.004s
>
> 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl
> -log_summary > log_summary_with_viennacl
>
> real    0m20.296s
> user    0m46.025s
> sys     0m1.435s
>
> The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from
> AMD-APP-SDK-v3.0.

there are a couple of things to note here:

a) The total execution time contains the OpenCL kernel compilation time, 
which is on the order of one or two seconds. Thus, you need much larger 
problem sizes to get a good comparison.

b) Most of the execution time is spent on VecMDot, which is optimized 
for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend 
because one can use just plain C/C++/whatever).

c) My experiences with this AMD APU are quite mixed, as I've never found 
a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU 
part. The integrated GPU, however, reached 80% without much effort. This 
is particularly remarkable as both CPU and GPU share the same DDR3 
memory link. Thus, it is more than unlikely that you will ever beat the 
performance of PETSc's native types.

> Attached are:
> 1) configure.log for the petsc build
> 2) log summary without viennacl
> 3) log summary with viennacl
> 4) OpenCL info for the system on which the runs were performed
>
> Perhaps the reason for the slow performance are superfluous copies being
> performed, which need not occur when running ViennaCL on the CPU.
> Looking at
> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx:
>
> /* Copies a vector from the CPU to the GPU unless we already have an up-to-date copy on the GPU */
> PetscErrorCode VecViennaCLCopyToGPU(Vec v)
> {
>    PetscErrorCode ierr;
>
>    PetscFunctionBegin;
>    ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);
>    if (v->map->n > 0) {
>      if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {
>        ierr = PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>        try {
>          ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;
>          viennacl::fast_copy(*(PetscScalar**)v->data, *(PetscScalar**)v->data + v->map->n, vec->begin());
>          ViennaCLWaitForGPU();
>        } catch(std::exception const & ex) {
>          SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s", ex.what());
>        }
>        ierr = PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
>        v->valid_GPU_array = PETSC_VIENNACL_BOTH;
>      }
>    }
>    PetscFunctionReturn(0);
> }
>
> When running ViennaCL with OpenCL on the CPU, the above function should
> maybe be modified?

Unfortunately that is quite hard: OpenCL manages its own memory handles, 
so 'injecting' memory into an OpenCL kernel that is not allocated by the 
OpenCL runtime is not recommended, fairly tricky, and still involves 
some overhead. As I see no reason to run OpenCL on a CPU, I refrained 
from adding this extra code complexity.

Overall, I recommend rerunning the benchmark on more powerful discrete 
GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any 
performance benefits.

Hope this sheds some light on things :-)

Best regards,
Karli