<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Mon, Oct 12, 2015 at 2:36 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

> On Oct 12, 2015, at 2:29 PM, Matthew Knepley <<a href="mailto:knepley@gmail.com">knepley@gmail.com</a>> wrote:<br>

><br>

> On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <<a href="mailto:mc0710@gmail.com">mc0710@gmail.com</a>> wrote:<br>

> Hi Karl,<br>

><br>

> My motivation was to avoid duplicating code for the CPU and the GPU. This is important considering that it takes a long time to test and make sure the code produces the right results.<br>

><br>

> I guess, I can add a switch in my code with something like:<br>

><br>

> if (usingCPU) use VecGetArray()<br>

><br>

> else if (usingGPU) use VecViennaCLGetArray()<br>

><br>

> and then wrap the pointers that the above functions return with OpenCL buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU and CL_ALLOC_.. for GPU)<br>

><br>

> Hopefully, this will avoid unnecessary data transfers.<br>

><br>

> I do not understand this comment at all. This looks crazy to me. The whole point of having Vec<br>

> is so that no one ever ever ever ever does anything like this. I saw nothing in the thread that would<br>

> compel you to do this. What are you trying to accomplish with this switch?<br>

<br>

  Matt,<br>

<br>

     The current OpenCL code in PETSc is hardwired for GPU usage. So the correct fix, I believe, is to add to the VecViennaCL wrappers support for either using the GPU or the CPU.<br></blockquote><div><br></div><div>Yes, that is an option.</div><div><br></div><div>I thought the upshot of Karl's mail was that while this is possible, OpenCL CPU performance is woeful and unlikely to improve, and</div><div>a better option is to use the current code with multiple MPI processes and the PETSc type mechanism.</div><div><br></div><div>  Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

   Barry<br>

<br>

><br>

>   Matt<br>

><br>

> Cheers,<br>

> Mani<br>

><br>

> On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <<a href="mailto:rupp@iue.tuwien.ac.at">rupp@iue.tuwien.ac.at</a>> wrote:<br>

> Hi Mani,<br>

><br>

> > Following <a href="http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf</a><br>

> (page 16), I ran KSP ex12 for two cases:<br>

><br>

> 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl<br>

><br>

> real    0m0.213s<br>

> user    0m0.206s<br>

> sys     0m0.004s<br>

><br>

> 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl<br>

> -log_summary > log_summary_with_viennacl<br>

><br>

> real    0m20.296s<br>

> user    0m46.025s<br>

> sys     0m1.435s<br>

><br>

> The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from<br>

> AMD-APP-SDK-v3.0.<br>

><br>

> there are a couple of things to note here:<br>

><br>

> a) The total execution time contains the OpenCL kernel compilation time, which is on the order of one or two seconds. Thus, you need much larger problem sizes to get a good comparison.<br>

><br>

> b) Most of the execution time is spent on VecMDot, which is optimized for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend because one can use just plain C/C++/whatever).<br>

><br>

> c) My experiences with this AMD APU are quite mixed, as I've never found a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU part. The integrated GPU, however, reached 80% without much effort. This is particularly remarkable as both CPU and GPU share the same DDR3 memory link. Thus, it is more than unlikely that you will ever beat the performance of PETSc's native types.<br>

><br>

><br>

><br>

> Attached are:<br>

> 1) configure.log for the petsc build<br>

> 2) log summary without viennacl<br>

> 3) log summary with viennacl<br>

> 4) OpenCL info for the system on which the runs were performed<br>

><br>

> Perhaps the reason for the slow performance are superfluous copies being<br>

> performed, which need not occur when running ViennaCL on the CPU.<br>

> Looking at<br>

> <a href="http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx</a>:<br>

><br>

> /* Copies a vector from the CPU to the GPU unless we already have an up-to-date copy on the GPU */<br>

> PetscErrorCode VecViennaCLCopyToGPU(Vec v)<br>

> {<br>

>    PetscErrorCode ierr;<br>

><br>

>    PetscFunctionBegin;<br>

>    ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);<br>

>    if (v->map->n > 0) {<br>

>      if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {<br>

>        ierr = PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);<br>

>        try {<br>

>          ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;<br>

>          viennacl::fast_copy(*(PetscScalar**)v->data, *(PetscScalar**)v->data + v->map->n, vec->begin());<br>

>          ViennaCLWaitForGPU();<br>

>        } catch(std::exception const & ex) {<br>

>          SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s", ex.what());<br>

>        }<br>

>        ierr = PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);<br>

>        v->valid_GPU_array = PETSC_VIENNACL_BOTH;<br>

>      }<br>

>    }<br>

>    PetscFunctionReturn(0);<br>

> }<br>

><br>

> When running ViennaCL with OpenCL on the CPU, the above function should<br>

> maybe be modified?<br>

><br>

> Unfortunately that is quite hard: OpenCL manages its own memory handles, so 'injecting' memory into an OpenCL kernel that is not allocated by the OpenCL runtime is not recommended, fairly tricky, and still involves some overhead. As I see no reason to run OpenCL on a CPU, I refrained from adding this extra code complexity.<br>

><br>

> Overall, I recommend rerunning the benchmark on more powerful discrete GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any performance benefits.<br>

><br>

> Hope this sheds some light on things :-)<br>

><br>

> Best regards,<br>

> Karli<br>

><br>

><br>

><br>

><br>

<span class="HOEnZb"><font color="#888888">><br>

> --<br>

> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

> -- Norbert Wiener<br>

<br>

</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div>