<div dir="ltr"><div><div><div><div>Here is the source file:<br><a href="https://github.com/AFD-Illinois/grim/blob/opencl/computeresidual.cl">https://github.com/AFD-Illinois/grim/blob/opencl/computeresidual.cl</a><br><br></div>Attached is the assembly code "assembly_code.asm" generated using:<br><br>ioc64 -input=<a href="http://computeresidual.cl">computeresidual.cl</a> -bo='-DOPENCL' -device='cpu' -asm=assembly_code<br><br></div>Cheers,<br></div>Mani<br><br></div>Caution: The source code in the opencl branch of <a href="https://github.com/AFD-Illinois/grim">https://github.com/AFD-Illinois/grim</a> is not very clean..<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Oct 12, 2015 at 4:28 PM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="">On Mon, Oct 12, 2015 at 3:25 PM, Mani Chandra <span dir="ltr"><<a href="mailto:mc0710@gmail.com" target="_blank">mc0710@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Here is the code: <a href="http://github.com/afd-illinois/grim" target="_blank">http://github.com/afd-illinois/grim</a>, branch:opencl.<br><div><br>Now using <br><br>Kernel Builder for OpenCL API - compiler command line, version 1.4.0.134<br>Copyright (C) 2014 Intel Corporation. All rights reserved.<br><br><br>manic@bh27:~/grim_opencl/grim> ioc64 -input=<a href="http://computeresidual.cl" target="_blank">computeresidual.cl</a> -bo='-DOPENCL' -device='cpu' <br>No command specified, using 'build' as default<br></div></div></blockquote><div><br></div></span><div>Sorry if I am being obtuse, but I cannot find that source file in the repo above. Can you give the direct link to the file?</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Using build options: -DOPENCL<br>Setting target instruction set architecture to: Default (Advanced Vector Extension (AVX))<br>OpenCL Intel CPU device was found!<br>Device name: Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz<br>Device version: OpenCL 1.2 (Build 44)<br>Device vendor: Intel(R) Corporation<br>Device profile: FULL_PROFILE<br>Compilation started<br>Compilation done<br>Linking started<br>Linking done<br>Device build started<br>Device build done<br>Kernel <ComputeResidual> was successfully vectorized<br>Done.<br>Build succeeded!<br></div></div></blockquote><div><br></div></span><div>It definitely says it vectorized, but what code did it generate. Can you post the object file since I do not have the compiler. I</div><div>have seen that message with really bad code before.</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div><div class="h5"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div></div><div>Cheers,<br></div><div>Mani<br></div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Oct 12, 2015 at 12:52 PM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span>On Mon, Oct 12, 2015 at 2:44 PM, Mani Chandra <span dir="ltr"><<a href="mailto:mc0710@gmail.com" target="_blank">mc0710@gmail.com</a>></span> wrote:<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Mon, Oct 12, 2015 at 12:36 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span><br>
> On Oct 12, 2015, at 2:29 PM, Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br>
><br>
> On Mon, Oct 12, 2015 at 2:13 PM, Mani Chandra <<a href="mailto:mc0710@gmail.com" target="_blank">mc0710@gmail.com</a>> wrote:<br>
> Hi Karl,<br>
><br>
> My motivation was to avoid duplicating code for the CPU and the GPU. This is important considering that it takes a long time to test and make sure the code produces the right results.<br>
><br>
> I guess, I can add a switch in my code with something like:<br>
><br>
> if (usingCPU) use VecGetArray()<br>
><br>
> else if (usingGPU) use VecViennaCLGetArray()<br>
><br>
> and then wrap the pointers that the above functions return with OpenCL buffers with the appropriate memory flags (CL_USE_HOST_PTR for CPU and CL_ALLOC_.. for GPU)<br>
><br>
> Hopefully, this will avoid unnecessary data transfers.<br>
><br>
> I do not understand this comment at all. This looks crazy to me. The whole point of having Vec<br>
> is so that no one ever ever ever ever does anything like this. I saw nothing in the thread that would<br>
> compel you to do this. What are you trying to accomplish with this switch?<br>
<br></span></blockquote><div><br></div><div>I'm trying to assemble the residual needed for SNES using an OpenCL kernel. The kernel operates on OpenCL buffers which can either live on the CPU or the GPU. <br><br></div><div>I think it is useful to use OpenCL on the CPU basically because of vectorization and vector data types. If I had to write usual C code, I'd have to use all sorts of pragmas in icc to get the code to vectorize and even then its pretty hard.<br></div></div></div></div></blockquote><div><br></div></span><div>I would completely agree with you, if I thought the compiler actually vectorized that code. I do not think that</div><div>is the case. Is there an example you have where you get vectorized assembly?</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div><div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div></div><div>Mani<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>
</span> Matt,<br>
<br>
The current OpenCL code in PETSc is hardwired for GPU usage. So the correct fix, I believe, is to add to the VecViennaCL wrappers support for either using the GPU or the CPU.<br>
<span><font color="#888888"><br>
Barry<br>
</font></span><div><div><br>
><br>
> Matt<br>
><br>
> Cheers,<br>
> Mani<br>
><br>
> On Sun, Oct 11, 2015 at 1:14 PM, Karl Rupp <<a href="mailto:rupp@iue.tuwien.ac.at" target="_blank">rupp@iue.tuwien.ac.at</a>> wrote:<br>
> Hi Mani,<br>
><br>
> > Following <a href="http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/petsc-20/conference/Rupp_K.pdf</a><br>
> (page 16), I ran KSP ex12 for two cases:<br>
><br>
> 1) time ./ex12 -m 100 -n 100 -log_summary > log_summary_no_viennacl<br>
><br>
> real 0m0.213s<br>
> user 0m0.206s<br>
> sys 0m0.004s<br>
><br>
> 2) ./ex12 -m 100 -n 100 -vec_type viennacl -mat_type aijviennacl<br>
> -log_summary > log_summary_with_viennacl<br>
><br>
> real 0m20.296s<br>
> user 0m46.025s<br>
> sys 0m1.435s<br>
><br>
> The runs have been performed on a CPU : AMD A10-5800K, with OpenCL from<br>
> AMD-APP-SDK-v3.0.<br>
><br>
> there are a couple of things to note here:<br>
><br>
> a) The total execution time contains the OpenCL kernel compilation time, which is on the order of one or two seconds. Thus, you need much larger problem sizes to get a good comparison.<br>
><br>
> b) Most of the execution time is spent on VecMDot, which is optimized for GPUs (CPUs are not an optimization goal in ViennaCL's OpenCL backend because one can use just plain C/C++/whatever).<br>
><br>
> c) My experiences with this AMD APU are quite mixed, as I've never found a way to get more than 45% of STREAM bandwidth with OpenCL on the CPU part. The integrated GPU, however, reached 80% without much effort. This is particularly remarkable as both CPU and GPU share the same DDR3 memory link. Thus, it is more than unlikely that you will ever beat the performance of PETSc's native types.<br>
><br>
><br>
><br>
> Attached are:<br>
> 1) configure.log for the petsc build<br>
> 2) log summary without viennacl<br>
> 3) log summary with viennacl<br>
> 4) OpenCL info for the system on which the runs were performed<br>
><br>
> Perhaps the reason for the slow performance are superfluous copies being<br>
> performed, which need not occur when running ViennaCL on the CPU.<br>
> Looking at<br>
> <a href="http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/petsc-dev/src/vec/vec/impls/seq/seqviennacl/vecviennacl.cxx</a>:<br>
><br>
> /* Copies a vector from the CPU to the GPU unless we already have an up-to-date copy on the GPU */<br>
> PetscErrorCode VecViennaCLCopyToGPU(Vec v)<br>
> {<br>
> PetscErrorCode ierr;<br>
><br>
> PetscFunctionBegin;<br>
> ierr = VecViennaCLAllocateCheck(v);CHKERRQ(ierr);<br>
> if (v->map->n > 0) {<br>
> if (v->valid_GPU_array == PETSC_VIENNACL_CPU) {<br>
> ierr = PetscLogEventBegin(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);<br>
> try {<br>
> ViennaCLVector *vec = ((Vec_ViennaCL*)v->spptr)->GPUarray;<br>
> viennacl::fast_copy(*(PetscScalar**)v->data, *(PetscScalar**)v->data + v->map->n, vec->begin());<br>
> ViennaCLWaitForGPU();<br>
> } catch(std::exception const & ex) {<br>
> SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"ViennaCL error: %s", ex.what());<br>
> }<br>
> ierr = PetscLogEventEnd(VEC_ViennaCLCopyToGPU,v,0,0,0);CHKERRQ(ierr);<br>
> v->valid_GPU_array = PETSC_VIENNACL_BOTH;<br>
> }<br>
> }<br>
> PetscFunctionReturn(0);<br>
> }<br>
><br>
> When running ViennaCL with OpenCL on the CPU, the above function should<br>
> maybe be modified?<br>
><br>
> Unfortunately that is quite hard: OpenCL manages its own memory handles, so 'injecting' memory into an OpenCL kernel that is not allocated by the OpenCL runtime is not recommended, fairly tricky, and still involves some overhead. As I see no reason to run OpenCL on a CPU, I refrained from adding this extra code complexity.<br>
><br>
> Overall, I recommend rerunning the benchmark on more powerful discrete GPUs with GDDR5 (or on-chip memory). Otherwise you won't see any performance benefits.<br>
><br>
> Hope this sheds some light on things :-)<br>
><br>
> Best regards,<br>
> Karli<br>
><br>
><br>
><br>
><br>
><br>
> --<br>
> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
> -- Norbert Wiener<br>
<br>
</div></div></blockquote></div><br></div></div>
</blockquote></div></div></div><div><div><br><br clear="all"><div><br></div>-- <br><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>
</div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div></div></div><div><div class="h5"><br><br clear="all"><div><br></div>-- <br><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>
</div></div></div></div>
</blockquote></div><br></div>