<div dir="ltr"><div><div><div><div><div><div><div>Hi Karl,<br><br></div>My motivation was to avoid duplicating code for the CPU and the GPU. This is important considering that it takes a long time to test and make sure the code produces the right results. <br><br></div>I guess, I can add a switch in my code with something like:<br><br></div>if (usingCPU) use VecGetArray()<br><br></div>else if (usingGPU) use VecViennaCLGetArray()<br><br></div><div>and then wrap the pointers that the above functions return with OpenCL buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU and CL_ALLOC_.. for GPU)<br></div><div><br></div>Hopefully, this will avoid unnecessary data transfers.<br><br></div>Cheers,<br></div>Mani<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <span dir="ltr"><<a href="mailto:rupp@iue.tuwien.ac.at" target="_blank">rupp@iue.tuwien.ac.at</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Mani,<span class=""><br>
<br>
> Following <a href="http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf</a><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
(page 16), I ran KSP ex12 for two cases:<br>
<br>
1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl<br>
<br>
real 0m0.213s<br>
user 0m0.206s<br>
sys 0m0.004s<br>
<br>
2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl<br>
-log_summary > log_summary_with_viennacl<br>
<br>
real 0m20.296s<br>
user 0m46.025s<br>
sys 0m1.435s<br>
<br>
The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from<br>
AMD-APP-SDK-v3.0.<br>
</blockquote>
<br></span>
there are a couple of things to note here:<br>
<br>
a) The total execution time contains the OpenCL kernel compilation time, which is on the order of one or two seconds. Thus, you need much larger problem sizes to get a good comparison.<br>
<br>
b) Most of the execution time is spent on VecMDot, which is optimized for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend because one can use just plain C/C++/whatever).<br>
<br>
c) My experiences with this AMD APU are quite mixed, as I've never found a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU part. The integrated GPU, however, reached 80% without much effort. This is particularly remarkable as both CPU and GPU share the same DDR3 memory link. Thus, it is more than unlikely that you will ever beat the performance of PETSc's native types.<div><div class="h5"><br>
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Attached are:<br>
1) configure.log for the petsc build<br>
2) log summary without viennacl<br>
3) log summary with viennacl<br>
4) OpenCL info for the system on which the runs were performed<br>
<br>
Perhaps the reason for the slow performance are superfluous copies being<br>
performed, which need not occur when running ViennaCL on the CPU.<br>
Looking at<br>
<a href="http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx</a>:<br>
<br>
/* Copies a vector from the CPU to the GPU unless we already have an up-to-date copy on the GPU */<br>
PetscErrorCode VecViennaCLCopyToGPU(Vec v)<br>
{<br>
PetscErrorCode ierr;<br>
<br>
PetscFunctionBegin;<br>
ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);<br>
if (v->map->n > 0) {<br>
if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {<br>
ierr = PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);<br>
try {<br>
ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;<br>
viennacl::fast_copy(*(PetscScalar**)v->data, *(PetscScalar**)v->data + v->map->n, vec->begin());<br>
ViennaCLWaitForGPU();<br>
} catch(std::exception const & ex) {<br>
SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s", ex.what());<br>
}<br>
ierr = PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);<br>
v->valid_GPU_array = PETSC_VIENNACL_BOTH;<br>
}<br>
}<br>
PetscFunctionReturn(0);<br>
}<br>
<br>
When running ViennaCL with OpenCL on the CPU, the above function should<br>
maybe be modified?<br>
</blockquote>
<br></div></div>
Unfortunately that is quite hard: OpenCL manages its own memory handles, so 'injecting' memory into an OpenCL kernel that is not allocated by the OpenCL runtime is not recommended, fairly tricky, and still involves some overhead. As I see no reason to run OpenCL on a CPU, I refrained from adding this extra code complexity.<br>
<br>
Overall, I recommend rerunning the benchmark on more powerful discrete GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any performance benefits.<br>
<br>
Hope this sheds some light on things :-)<br>
<br>
Best regards,<br>
Karli<br>
<br>
</blockquote></div><br></div>