<div dir="ltr">Yes, this looks great. Good to get some data on this. Hard problem.</div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 27, 2017 at 3:31 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

   Hong,<br>

<br>

  Very nice, thanks!<br>

<br>

    Barry<br>

<div><div class="h5"><br>

> On Feb 27, 2017, at 2:06 PM, Hong <<a href="mailto:hzhang@mcs.anl.gov">hzhang@mcs.anl.gov</a>> wrote:<br>

><br>

> Mark,<br>

> I implemented scalable MatPtAP and did comparisons of three implementations using ex56.c on alcf cetus machine (this machine has small memory, 1GB/core):<br>

> - nonscalable PtAP: use an array of length PN to do dense axpy<br>

> - scalable PtAP:       do sparse axpy without use of PN array<br>

> - hypre PtAP.<br>

><br>

> The results are attached. Summary:<br>

> - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP<br>

> - scalable PtAP is 4x faster than hypre PtAP<br>

> - hypre uses less memory (see <a href="http://job.ne399.n63.np1000.sh" rel="noreferrer" target="_blank">job.ne399.n63.np1000.sh</a>)<br>

><br>

> Based on above observation, I set the default PtAP algorithm as 'nonscalable'.<br>

> When PN > local estimated nonzero of C=PtAP, then switch default to 'scalable'.<br>

> User can overwrite default.<br>

><br>

> For the case of np=8000, ne=599 (see <a href="http://job.ne599.n500.np8000.sh" rel="noreferrer" target="_blank">job.ne599.n500.np8000.sh</a>), I get<br>

> MatPtAP                   3.6224e+01 (nonscalable for small mats, scalable for larger ones)<br>

> scalable MatPtAP     4.6129e+01<br>

> hypre                        1.9389e+02<br>

><br>

> I'm merging this work to next, then to master soon.<br>

><br>

> Hong<br>

><br>

> On Wed, Feb 8, 2017 at 7:25 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br>

> Thanks Hong, that sounds great.<br>

><br>

> I am weary of silent optimizations like you suggest but 2x is big! and failure is very bad. So I would vote for your suggestion.<br>

><br>

> ex56 is an elasticity problem. It would be nice, now that you have this experimental setup in place, to compare with hypre on a 3D scalar problem. Hypre might not spend much effort optimizing for block matrices. 3x better than hypre seems large to me.  I have to suspect that hypre does not exploit blocked matrices as much as we do.<br>

><br>

> Thanks again,<br>

><br>

><br>

><br>

> On Wed, Feb 8, 2017 at 12:07 PM, Hong <<a href="mailto:hzhang@mcs.anl.gov">hzhang@mcs.anl.gov</a>> wrote:<br>

> I conducted tests on MatMatMult() and MatPtAP() using petsc/src/ksp/ksp/examples/<wbr>tutorials/ex56.c (gamg) on a 8-core machine (petsc machine). The output file is attached.<br>

><br>

> Summary:<br>

> 1) non-scalable MatMatMult() for mpiaij format is 2x faster than scalable version. The major difference between the two is dense-axpy vs. sparse-axpy.<br>

><br>

> Currently, we set non-scalable as default, which leads to problem when running large problems.<br>

> How about setting default as<br>

>  - non-scalable for small to medium size matrices<br>

>  - scalable for larger ones, e.g.<br>

><br>

> +    ierr = PetscOptionsEList("-<wbr>matmatmult_via","Algorithmic approach","MatMatMult",<wbr>algTypes,nalg,algTypes[1],&<wbr>alg,&flg);<br>

><br>

> +    if (!flg) { /* set default algorithm based on B->cmap->N */<br>

> +      PetscMPIInt size;<br>

> +      ierr = MPI_Comm_size(comm,&size);<wbr>CHKERRQ(ierr);<br>

> +      if ((PetscReal)(B->cmap->N)/size > 100000.0) alg = 0; /* scalable algorithm */<br>

> +    }<br>

><br>

> i.e., if user does NOT pick an algorithm, when ave cols per process > 100k, use scalable implementation; otherwise, non-scalable version.<br>

><br>

> 2) We do NOT have scalable implementation for MatPtAP() yet.<br>

> We have non-scalable PtAP and interface to Hypre's PtAP. Comparing the two,<br>

> Petsc MatPtAP() is approx 3x faster than Hypre's.<br>

><br>

> I'm writing a scalable MatPtAP() now.<br>

><br>

> Hong<br>

><br>

> On Thu, Feb 2, 2017 at 2:54 PM, Stefano Zampini <<a href="mailto:stefano.zampini@gmail.com">stefano.zampini@gmail.com</a>> wrote:<br>

><br>

><br>

> Il 02 Feb 2017 23:43, "Mark Adams" <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> ha scritto:<br>

><br>

><br>

> On Thu, Feb 2, 2017 at 12:02 PM, Stefano Zampini <<a href="mailto:stefano.zampini@gmail.com">stefano.zampini@gmail.com</a>> wrote:<br>

> Mark,<br>

><br>

> I saw your configuration has hypre. If you could run with master, you may try -matptap_via hypre.<br>

><br>

> This is worth trying. Does this even work with GAMG?<br>

><br>

> Yes, it should work, except that the block sizes, if any, are not propagated to the resulting matrix. I can add it if you need it.<br>

><br>

><br>

><br>

> Treb: try hypre anyway. It has its own RAP code.<br>

><br>

><br>

> With that option, you will use hypre's RAP with MATAIJ<br>

><br>

><br>

> It uses BoomerAMGBuildCoarseOperator directly with the AIJ matrices.<br>

><br>

> Stefano<br>

><br>

>> On Feb 2, 2017, at 7:28 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br>

>><br>

>><br>

>><br>

>> On Thu, Feb 2, 2017 at 11:13 AM, Hong <<a href="mailto:hzhang@mcs.anl.gov">hzhang@mcs.anl.gov</a>> wrote:<br>

>> Mark:<br>

>> Try '-matmatmult_via scalable' first. If this works, should we set it as default?<br>

>><br>

>> If it is robust I would say yes unless it is noticeably slower (say >20%) small scale problems.<br>

>><br>

><br>

><br>

><br>

><br>

><br>

><br>

</div></div>> <out_ex56_cetus_short><br>

<br>

</blockquote></div><br></div>