<br>2011/11/29 Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><div>On Tue, Nov 29, 2011 at 10:37 AM, Fredrik Heffer Valdmanis <span dir="ltr"><<a href="mailto:fredva@ifi.uio.no" target="_blank">fredva@ifi.uio.no</a>></span> wrote:<br></div></div>
<div class="gmail_quote"><div><div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
2011/11/29 Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span><br><div class="gmail_quote"><div><div></div><div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><div>On Tue, Nov 29, 2011 at 2:38 AM, Fredrik Heffer Valdmanis <span dir="ltr"><<a href="mailto:fredva@ifi.uio.no" target="_blank">fredva@ifi.uio.no</a>></span> wrote:<br></div></div>
<div class="gmail_quote">
<div><div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
2011/10/28 Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span><br><div class="gmail_quote"><div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>On Fri, Oct 28, 2011 at 10:24 AM, Fredrik Heffer Valdmanis <span dir="ltr"><<a href="mailto:fredva@ifi.uio.no" target="_blank">fredva@ifi.uio.no</a>></span> wrote:<br></div><div class="gmail_quote"><div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi,<div><br></div><div>I am working on integrating the new GPU based vectors and matrices into FEniCS. Now, I'm looking at the possibility for getting some speedup during finite element assembly, specifically when inserting the local element matrix into the global element matrix. In that regard, I have a few questions I hope you can help me out with:</div>
<div><br></div><div>- When calling MatSetValues with a MATSEQAIJCUSP matrix as parameter, what exactly is it that happens? As far as I can see, MatSetValues is not implemented for GPU based matrices, neither is the mat->ops->setvalues set to point at any function for this Mat type. </div>
</blockquote><div><br></div></div><div>Yes, MatSetValues always operates on the CPU side. It would not make sense to do individual operations on the GPU.</div><div><br></div><div>I have written batched of assembly for element matrices that are all the same size:</div>
<div><br></div><div> <a href="http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatSetValuesBatch.html" target="_blank">http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatSetValuesBatch.html</a></div>
<div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>- Is it such that matrices are assembled in their entirety on the CPU, and then copied over to the GPU (after calling MatAssemblyBegin)? Or are values copied over to the GPU each time you call MatSetValues?</div>
</blockquote><div><br></div></div><div>That function assembles the matrix on the GPU and then copies to the CPU. The only time you do not want this copy is when</div><div>you are running in serial and never touch the matrix afterwards, so I left it in.</div>
<div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>- Can we expect to see any speedup from using MatSetValuesBatch over MatSetValues, or is the batch version simply a utility function? This question goes for both CPU- and GPU-based matrices.</div>
</blockquote><div><br></div></div><div>CPU: no</div><div><br></div><div>GPU: yes, I see about the memory bandwidth ratio</div><div><br></div><div><br></div></div></blockquote></div><div>Hi,</div><div><br></div><div>I have now integrated MatSetValuesBatch in our existing PETSc wrapper layer. I have tested matrix assembly with Poisson's equation on different meshes with elements of varying order. I have timed the single call to MatSetValuesBatch and compared that to the total time consumed by the repeated calls to MatSetValues in the old implementation. I have the following results:</div>
<div><br></div><div>Poisson on 1000x1000 unit square, 1st order Lagrange elements:</div><div><div><div>MatSetValuesBatch: 0.88576 s</div><div>repeated calls to MatSetValues: 0.76654 s</div></div></div><div><br></div><div>
<div>Poisson on 500x500 unit square, 2nd order Lagrange elements:</div>
<div>MatSetValuesBatch: 0.9324 s</div><div>repeated calls to MatSetValues: 0.81644 s</div></div><div><br></div><div>Poisson on 300x300 unit square, 3rd order Lagrange elements:</div><div>MatSetValuesBatch: 0.93988 s</div>
<div>repeated calls to MatSetValues: 1.03884 s</div>
<div><br></div><div>As you can see, the two methods take almost the same amount of time. What behavior and performance should we expect? Is there any way to optimize the performance of batched assembly?</div></div></blockquote>
<div><br></div></div></div><div>Almost certainly it is not dispatching to the CUDA version. The regular version just calls MatSetValues() in a loop. Are you</div><div>using a SEQAIJCUSP matrix?</div></div></blockquote></div>
</div><div>
Yes. The same matrices yields a speedup of 4-6x when solving the system on the GPU. </div></div></blockquote><div><br></div></div></div><div>Please confirm that the correct routine by running wth -info and sending the output.</div>
<div>
<br></div><div>Please send the output of -log_summary so I can confirm the results.</div><div><br></div><div>You can run KSP ex4 and reproduce my results where I see a 5.5x speedup on the GTX285</div><div><br></div></div>
</blockquote><div>I am not sure what to look for in those outputs. I have uploaded the output of running my assembly program with -info and -log_summary, and the output of running ex4 with -log_summary. See</div><div><br>
</div><div><a href="http://folk.uio.no/fredva/assembly_info.txt">http://folk.uio.no/fredva/assembly_info.txt</a></div><div><a href="http://folk.uio.no/fredva/assembly_log_summary.txt">http://folk.uio.no/fredva/assembly_log_summary.txt</a></div>
<div><a href="http://folk.uio.no/fredva/ex4_log_summary.txt">http://folk.uio.no/fredva/ex4_log_summary.txt</a></div><div><br></div><div>Trying this on a different machine now, I actually see some speedup. 3rd order Poisson on 300x300 assembles in 0.211 sec on GPU and 0.4232 sec on CPU. For 1st order and 1000x1000 mesh, I go from 0.31 sec to 0.205 sec. </div>
<div>I have tried to increase the mesh size to see if the speedup increases, but I hit the bad_alloc error pretty quick. </div><div><br>For a problem of that size, should I expect even more speedup? Please let me know if you need any more output from test runs on my machine. </div>
<div><br></div><div>-- </div><div>Fredrik</div></div>