<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On 14 October 2015 at 16:50, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="">On Wed, Oct 14, 2015 at 7:34 AM, Timothée Nicolas <span dir="ltr"><<a href="mailto:timothee.nicolas@gmail.com" target="_blank">timothee.nicolas@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>OK, I see. Does it mean that the coarse grid solver is by default set up with the options -ksp_type preonly -pc_type lu ? What about the multiprocessor case ?<br></div></div></blockquote><div><br></div></span><div>Small scale: We use redundant LU</div><div><br></div><div>Large Scale: We use GAMG</div><div><br></div></div></div></div></blockquote><div><br></div><div>Is your answer what "you" recommend, or what PETSc does by default?<br></div><div><br>Your answer gives the impression that PETSc makes a decision regarding the choice of either redundant/LU or gamg based on something - e.g. the size of the matrix, the number of cores (or some combination of the two).<br></div><div>Is that really what is happening inside PCMG?<br></div><div><br></div><div><br></div><div><br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div></div><div> Matt</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div></div><div>Thx<br><br></div><div>Timothee<br></div></div><div class="gmail_extra"><br><div class="gmail_quote">2015-10-14 21:22 GMT+09:00 Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span>On Tue, Oct 13, 2015 at 9:23 PM, Timothée Nicolas <span dir="ltr"><<a href="mailto:timothee.nicolas@gmail.com" target="_blank">timothee.nicolas@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div>Dear all,<br><br></div>I have been playing around with multigrid recently, namely with /ksp/ksp/examples/tutorials/ex42.c, with /snes/examples/tutorial/ex5.c and with my own implementation of a laplacian type problem. In all cases, I have noted no improvement whatsoever in the performance, whether in CPU time or KSP iteration, by varying the number of levels of the multigrid solver. As an example, I have attached the log_summary for ex5.c with nlevels = 2 to 7, launched by <br><br>mpiexec -n 1 ./ex5 -da_grid_x 21 -da_grid_y 21 -ksp_rtol 1.0e-9 -da_refine 6 -pc_type mg -pc_mg_levels # -snes_monitor -ksp_monitor -log_summary<br><br></div>where -pc_mg_levels is set to a number between 2 and 7.<br><br></div>So there is a noticeable CPU time improvement from 2 levels to 3 levels (30%), and then no improvement whatsoever. I am surprised because with 6 levels of refinement of the DMDA the fine grid has more than 1200 points so with 3 levels the coarse grid still has more than 300 points which is still pretty large (I assume the ratio between grids is 2). I am wondering how the coarse solver efficiently solves the problem on the coarse grid with such a large number of points ? Given the principle of multigrid which is to erase the smooth part of the error with relaxation methods, which are usually efficient only for high frequency, I would expect optimal performance when the coarse grid is basically just a few points in each direction. Does anyone know why the performance saturates at low number of levels ? Basically what happens internally seems to be quite different from what I would expect...<br></div></div></div></blockquote><div><br></div></span><div>A performance model that counts only flops is not sophisticated enough to understand this effect. Unfortunately, nearly all MG</div><div>books/papers use this model. What we need is a model that incorporates memory bandwidth (for pulling down the values), and</div><div>also maybe memory latency. For instance, your relaxation pulls down all the values and makes a little progress. It does few flops,</div><div>but lots of memory access. An LU solve does a little memory access, many more flops, but makes a lots more progress. If memory</div><div>access is more expensive, then we have a tradeoff, and can understand using a coarse grid which is not just a few points.</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div></div>Best<br><br></div>Timothee<span><font color="#888888"><br></font></span></div><span><font color="#888888">
</font></span></blockquote></div><span><font color="#888888"><br><br clear="all"><span><font color="#888888"><div><br></div>-- <br><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>
</font></span></font></span></div></div>
</blockquote></div><br></div>
</blockquote></span></div><span class=""><br><br clear="all"><div><br></div>-- <br><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>
</span></div></div>
</blockquote></div><br></div></div>