[petsc-dev] Slow ViennaCL performance on KSP ex12
Karl Rupp
rupp at iue.tuwien.ac.at
Sun Oct 11 15:14:37 CDT 2015
Hi Mani,
> Following http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf
> (page 16), I ran KSP ex12 for two cases:
>
> 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl
>
> real 0m0.213s
> user 0m0.206s
> sys 0m0.004s
>
> 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl
> -log_summary > log_summary_with_viennacl
>
> real 0m20.296s
> user 0m46.025s
> sys 0m1.435s
>
> The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from
> AMD-APP-SDK-v3.0.
there are a couple of things to note here:
a) The total execution time contains the OpenCL kernel compilation time,
which is on the order of one or two seconds. Thus, you need much larger
problem sizes to get a good comparison.
b) Most of the execution time is spent on VecMDot, which is optimized
for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend
because one can use just plain C/C++/whatever).
c) My experiences with this AMD APU are quite mixed, as I've never found
a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU
part. The integrated GPU, however, reached 80% without much effort. This
is particularly remarkable as both CPU and GPU share the same DDR3
memory link. Thus, it is more than unlikely that you will ever beat the
performance of PETSc's native types.
> Attached are:
> 1) configure.log for the petsc build
> 2) log summary without viennacl
> 3) log summary with viennacl
> 4) OpenCL info for the system on which the runs were performed
>
> Perhaps the reason for the slow performance are superfluous copies being
> performed, which need not occur when running ViennaCL on the CPU.
> Looking at
> http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx:
>
> /* Copies a vector from the CPU to the GPU unless we already have an up-to-date copy on the GPU */
> PetscErrorCode VecViennaCLCopyToGPU(Vec v)
> {
> PetscErrorCode ierr;
>
> PetscFunctionBegin;
> ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);
> if (v->map->n > 0) {
> if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {
> ierr = PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
> try {
> ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;
> viennacl::fast_copy(*(PetscScalar**)v->data, *(PetscScalar**)v->data + v->map->n, vec->begin());
> ViennaCLWaitForGPU();
> } catch(std::exception const & ex) {
> SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s", ex.what());
> }
> ierr = PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);
> v->valid_GPU_array = PETSC_VIENNACL_BOTH;
> }
> }
> PetscFunctionReturn(0);
> }
>
> When running ViennaCL with OpenCL on the CPU, the above function should
> maybe be modified?
Unfortunately that is quite hard: OpenCL manages its own memory handles,
so 'injecting' memory into an OpenCL kernel that is not allocated by the
OpenCL runtime is not recommended, fairly tricky, and still involves
some overhead. As I see no reason to run OpenCL on a CPU, I refrained
from adding this extra code complexity.
Overall, I recommend rerunning the benchmark on more powerful discrete
GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any
performance benefits.
Hope this sheds some light on things :-)
Best regards,
Karli
More information about the petsc-dev
mailing list