<div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Hi Mark,</div><div><br></div><div>Thanks for your email.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Mar 21, 2019 at 6:39 AM Mark Adams via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov">petsc-dev@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>I'm probably screwing up some sort of history by jumping into dev, but this is a dev comment ...</div><div><br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>(1) -matptap_via hypre: This call the hypre package to do the PtAP trough an all-at-once triple product. In our experiences, it is the most memory efficient, but could be slow.</div></div></div></div></div></div></div></blockquote><div><br></div><div>FYI,</div><div><br></div><div>I visited LLNL in about 1997 and told them how I did RAP. Simple 4 nested loops. They were very interested. Clearly they did it this way after I talked to them. This approach came up here a while back (eg, we should offer this as an option).</div><div><br></div><div>Anecdotally, I don't see a noticeable difference in performance on my 3D elasticity problems between my old code (still used by the bone modeling people) and ex56 ...</div></div></div></blockquote><div><br></div><div>You may not see differences when the problem is small.  What I observed is that the HYPRE PtAP is ten times slower than the PETSc scalable PtAP when we had a 3-billions problem on 10K processor cores. </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div>My kernel is an unrolled dense matrix triple product. I doubt Hypre did this. It ran at about 2x+ the flop rate of the mat-vec at scale on the SP3 in 2004.</div></div></div></blockquote><div><br></div><div>Could you explain this more by adding some small examples?   </div><div><br></div><div> I am profiling the current PETSc algorithms on some real simulations. If PETSc PtAP still takes more memory than desired with my fix (<a href="https://bitbucket.org/petsc/petsc/pull-requests/1452">https://bitbucket.org/petsc/petsc/pull-requests/1452</a>), I am going to implement the all-at-once triple product with dropping all intermediate data.  If you have any documents (except the code you posted before), it would be a great help.</div><div><br></div><div>Fande,</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div>Mark</div><div> </div></div></div>

</blockquote></div></div></div></div>