<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, May 7, 2015 at 8:26 AM, Justin Chang <span dir="ltr"><<a href="mailto:jychang48@gmail.com" target="_blank">jychang48@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div>Matt again thank you very much for all your help, last question(s) though.<br><br></div>I am not quite sure I follow your second comment. Do you mean that this whole analysis should be done on one processor (in sequential mode)? Or do you mean that calculating the TBT for these vec and matmult ops should assume the local size of the vector/matrix? When I said MPI_Allreduce, i meant collecting everything after calculating the local TBT for each processor. Most of these vec ops' calculations are local so it shouldn't be too much trouble, but for things like matmult and vecnorm, would it be quite convoluted because of interprocessor communication?<br></div></div></div></blockquote><div><br></div><div>I meant that I would do the analysis for one process and assume that is generalized smoothly.</div><div><br></div><div>  Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div></div>Thanks,<br></div>Justin<br></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 6, 2015 at 2:34 PM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span>On Wed, May 6, 2015 at 2:28 PM, Justin Chang <span dir="ltr"><<a href="mailto:jychang48@gmail.com" target="_blank">jychang48@gmail.com</a>></span> wrote:<br></span><span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div>So basically I just need these types of operations:<br><br>VecTDot               60 1.0 1.6928e-05 1.0 2.69e+04 1.0 0.0e+00 0.0e+00 0.0e+00  0 14  0  0  0   0 14  0  0  0  1591<br>VecNorm               31 1.0 9.0599e-06 1.0 1.39e+04 1.0 0.0e+00 0.0e+00 0.0e+00  0  7  0  0  0   0  7  0  0  0  1536<br>VecCopy                3 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>VecSet                36 1.0 9.4175e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0<br>VecAXPY               60 1.0 1.7166e-05 1.0 2.70e+04 1.0 0.0e+00 0.0e+00 0.0e+00  0 14  0  0  0   0 14  0  0  0  1573<br>VecAYPX               29 1.0 1.9312e-05 1.0 1.30e+04 1.0 0.0e+00 0.0e+00 0.0e+00  0  7  0  0  0   0  7  0  0  0   676<br>VecWAXPY               1 1.0 1.9073e-06 1.0 2.25e+02 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   118<br>VecPointwiseMult      31 1.0 1.8358e-05 1.0 6.98e+03 1.0 0.0e+00 0.0e+00 0.0e+00  0  4  0  0  0   0  4  0  0  0   380<br>MatMult               30 1.0 7.5340e-05 1.0 8.07e+04 1.0 0.0e+00 0.0e+00 0.0e+00  0 42  0  0  0   0 42  0  0  0  1071<br><br></div>Given the matrix size and the number of nonzeros, vector size of my solution, and the number of calls to each of these vec ops, I can estimate the total bytes transferred (loads/stores)? <br></div></div></div></div></div></blockquote><div><br></div></span><div>Yes, exactly.</div><span><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div><div><div>For multiple processors, do I calculate the local or global sizes? If I calculate the local sizes, then do I need to do an MPI_Allreduce (with mpi_sum) of the TBT across all processors just like with total flops?<br></div></div></div></div></div></blockquote><div><br></div></span><div>Just do this analysis locally I think. Dealing with reductions adds another level of complexity.</div><span><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div><div><div></div>Do these vec/mat ops account for Dirichlet constraints? That is, should the global/local size include those constraints?<br></div></div></div></div></blockquote><div><br></div></span><div>Hopefully the Dirichlet constraints are not significant. Certainly in the limit of large N they drop out. The solver operations</div><div>are in the global space, and the assembly operations are in the local space.</div><span><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div><div></div>Also, is there a way to extract the event count for these operations besides dumping -log_summary each time?<br></div></div></div></blockquote><div><br></div></span><div>Yes, using the PetscStageLog and PetscEventLog interfaces. Here is an example from dtfe.c:</div><div><br></div><div><div>  PetscStageLog     stageLog;</div><div>  PetscEventPerfLog eventLog = NULL;</div><div>  PetscInt          stage;</div><div>  PetscErrorCode    ierr;</div><div><br></div><div>  PetscFunctionBegin;</div><div>  ierr = PetscLogGetStageLog(&stageLog);CHKERRQ(ierr);</div><div>  ierr = PetscStageLogGetCurrent(stageLog, &stage);CHKERRQ(ierr);</div><div>  ierr = PetscStageLogGetEventPerfLog(stageLog, stage, &eventLog);CHKERRQ(ierr);</div><div>    /* Log performance info */</div><div>  eventLog->eventInfo[ocl->residualEvent].count++;</div><div>  eventLog->eventInfo[ocl->residualEvent].time  += time;</div><div>  eventLog->eventInfo[ocl->residualEvent].flops += flops;</div></div><div><div><div><br></div><div><br></div><div>  Thanks,</div><div><br></div><div>     Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div></div>Thanks,<br></div>Justin<br></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 6, 2015 at 1:38 PM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><span><div>On Wed, May 6, 2015 at 1:28 PM, Justin Chang <span dir="ltr"><<a href="mailto:jychang48@gmail.com" target="_blank">jychang48@gmail.com</a>></span> wrote:<br></div></span><div class="gmail_extra"><div class="gmail_quote"><span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div>Jed,<br><br></div>I am working with anisotropic diffusion and most standard numerical formulations (e.g., FEM, FVM, etc.) are "wrong" because they violate the discrete maximum principle, see Nakshatrala & Valocci (JCP 2009) for more on this. What we have seen people do is simply "ignore" or chop off these values but to us that is a complete and utter abomination. My goal here is to show that our proposed methodologies work by leveraging on the capabilities within PETSc and TAO and to also show how computationally expensive it is compared to solving the same problem using the standard Galerkin method. <br><br></div>Matt,<br><br></div>Okay, so then I guess I still have questions regarding how to obtain the bytes. How exactly would I count all the number of Vecs and their respective sizes, because it seems all the DMPlex related functions create many vectors. Or do I only count the DM created vectors used for my solution vector, residual, lower/upper bound, optimization routines, etc?<br></div></div></div></blockquote><div><br></div></span><div>This is laborious. You would build up from the small stuff. So a Krylov solver have MatMult, for which there is an analysis in the paper with Dinesh/Bill/Barry/David, and</div><div>Vec ops which are easy. This is a lot of counting, especially if you have a TAO solver in there. I would make sure you really care.</div><span><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div></div>And when you say "forget about small stuff", does that include all the DMPlex creation routines, PetscMalloc'ed arrays, pointwise functions, and all the jazz that goes on within the FE/discretization routines?<br></div></div></blockquote><div><br></div></span><div>Yep.</div><span><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div>Lastly, for a Matrix, wouldn't I just get the number of bytes from the memory usage section in -log_summary?<br></div></div></blockquote><div><br></div></span><div>That is a good way. You can also ask MatInfo how many nonzeros the matrix has.</div><div><div><div><br></div><div>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div></div><div>Thanks,<br></div><div>Justin<br></div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 6, 2015 at 11:48 AM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span>On Wed, May 6, 2015 at 11:41 AM, Justin Chang <span dir="ltr"><<a href="mailto:jychang48@gmail.com" target="_blank">jychang48@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div><div>I suppose I have two objectives that I think are achievable within PETSc means: <br><br>1)

 How much wall-clock time can be reduced as you increase the number of 

processors. I have strong-scaling and parallel efficiency metrics that 

convey this. <br><br>2) The "optimal" problem size for these two 

methods/solvers are. What I mean by this is, at what point do I achieve 

the maximum FLOPS/s. If starting off with a really small problem then 

this metric should increase with problem size. My hypothesis is that as problem size increases, the ratio of wall-clock time spent in idle (e.g., waiting 

for cache to free up, accessing main memory, etc) to performing work also increases, and the reported FLOPS/s should start decreasing at some point. "Efficiency" in this context simply means the highest possible FLOPS/s.<br><br></div>Does that make sense and/or is "interesting" enough?<br></div></div></div></blockquote><div><br></div></span><div>I think 2) is not really that interesting because</div><div><br></div><div>  a) it is so easily gamed. Just stick in high flop count operations, like DGEMM.</div><div><br></div><div>  b) Time really matters to people who run the code, but flops never do.</div><div><br></div><div>  c) Floating point performance is not your limiting factor for time</div><div><br></div><div>I think it would be much more interesting, and no more work to</div><div><br></div><div>  a) Model the flop/byte \beta ratio simply</div><div><br></div><div>  b) Report how close you get to the max performance given \beta on your machine</div><div><br></div><div>  Thanks,</div><div><br></div><div>     Matt</div><div><div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div></div>Thanks,<br></div>Justin<br></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 6, 2015 at 11:28 AM, Jed Brown <span dir="ltr"><<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span>Justin Chang <<a href="mailto:jychang48@gmail.com" target="_blank">jychang48@gmail.com</a>> writes:<br>

> I already have speedup/strong scaling results that essentially depict the<br>

> difference between the KSPSolve() and TaoSolve(). However, I have been told<br>

> by someone that strong-scaling isn't enough - that I should somehow include<br>

> something to show the "efficiency" of these two methodologies.<br>

<br>

</span>"Efficiency" is irrelevant if one is wrong.  Can you set up a problem<br>

where both get the right answer and vary a parameter to get to the case<br>

where one fails?  Then you can look at efficiency for a given accuracy<br>

(and you might have to refine the grid differently) as you vary the<br>

parameter.<br>

<br>

It's really hard to demonstrate that an implicit solver is optimal in<br>

terms of mathematical convergence rate.  Improvements there can dwarf<br>

any differences in implementation efficiency.<br>

<div><div><br>

> That is, how much of the wall-clock time reported by these two very<br>

> different solvers is spent doing useful work.<br>

><br>

> Is such an "efficiency" metric necessary to report in addition to<br>

> strong-scaling results? The overall computational framework is the same for<br>

> both problems, the only difference being one uses a linear solver and the<br>

> other uses an optimization solver. My first thought was to use PAPI to<br>

> include hardware counters, but these are notoriously inaccurate. Then I<br>

> thought about simply reporting the manual FLOPS and FLOPS/s via PETSc, but<br>

> these metrics ignore memory bandwidth. And so here I am looking at the idea<br>

> of implementing the Roofline model, but now I am wondering if any of this<br>

> is worth the trouble.<br>

<br>

</div></div></blockquote></div><br></div>

</div></div></blockquote></div></div></div><br><br clear="all"><span><div><br></div>-- <br><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</span></div></div>

</blockquote></div><br></div>

</div></div></blockquote></div></div></div><div><div><br><br clear="all"><div><br></div>-- <br><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div></div></div>

</blockquote></div><br></div>

</div></div></blockquote></div></div></div><div><div><br><br clear="all"><div><br></div>-- <br><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div></div></div>

</blockquote></div><br></div>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div>