[petsc-users] Code speedup after upgrading

Sun Mar 28 19:48:20 CDT 2021

Is there an option to turn off MAT_SUBSET_OFF_PROC_ENTRIES for Mohammad to
try?

--Junchao Zhang

On Sun, Mar 28, 2021 at 3:34 PM Jed Brown <jed at jedbrown.org> wrote:

> I take it this was using MAT_SUBSET_OFF_PROC_ENTRIES. I implemented that
> to help performance of PHASTA and other applications that assemble matrices
> that are relatively cheap to solve (so assembly cost is significant
> compared to preconditioner setup and KSPSolve) and I'm glad it helps so
> much here.
>
> I don't have an explanation for why you're observing local vector
> operations like VecScale and VecMAXPY running over twice as fast in the new
> code. These consist of simple code that has not changed, and which are
> normally memory bandwidth limited (though some of your problem sizes might
> fit in cache).
>
> Mohammad Gohardoust <gohardoust at gmail.com> writes:
>
> > Here is the plot of run time in old and new petsc using 1,2,4,8, and 16
> > CPUs (in logarithmic scale):
> >
> > [image: Screenshot from 2021-03-28 10-48-56.png]
> >
> >
> >
> >
> > On Thu, Mar 25, 2021 at 12:51 PM Mohammad Gohardoust <
> gohardoust at gmail.com>
> > wrote:
> >
> >> That's right, these loops also take roughly half time as well. If I am
> not
> >> mistaken, petsc (MatSetValue) is called after doing some calculations
> over
> >> each tetrahedral element.
> >> Thanks for your suggestion. I will try that and will post the results.
> >>
> >> Mohammad
> >>
> >> On Wed, Mar 24, 2021 at 3:23 PM Junchao Zhang <junchao.zhang at gmail.com>
> >> wrote:
> >>
> >>>
> >>>
> >>>
> >>> On Wed, Mar 24, 2021 at 2:17 AM Mohammad Gohardoust <
> gohardoust at gmail.com>
> >>> wrote:
> >>>
> >>>> So the code itself is a finite-element scheme and in stage 1 and 3
> there
> >>>> are expensive loops over entire mesh elements which consume a lot of
> time.
> >>>>
> >>> So these expensive loops must also take half time with newer petsc?
> And
> >>> these loops do not call petsc routines?
> >>> I think you can build two PETSc versions with the same configuration
> >>> options, then run your code with one MPI rank to see if there is a
> >>> difference.
> >>> If they give the same performance, then scale to 2, 4, ... ranks and
> see
> >>> what happens.
> >>>
> >>>
> >>>
> >>>>
> >>>> Mohammad
> >>>>
> >>>> On Tue, Mar 23, 2021 at 6:08 PM Junchao Zhang <
> junchao.zhang at gmail.com>
> >>>> wrote:
> >>>>
> >>>>> In the new log, I saw
> >>>>>
> >>>>> Summary of Stages:   ----- Time ------  ----- Flop ------  ---
> Messages ---  -- Message Lengths --  -- Reductions --
> >>>>>                         Avg     %Total     Avg     %Total    Count
>  %Total     Avg         %Total    Count   %Total
> >>>>>  0:      Main Stage: 5.4095e+00   2.3%  4.3700e+03   0.0%
> 4.764e+05   3.0%  3.135e+02        1.0%  2.244e+04  12.6% 1:
> Solute_Assembly: 1.3977e+02  59.4%  7.3353e+09   4.6%  3.263e+06  20.7%
> 1.278e+03       26.9%  1.059e+04   6.0%
> >>>>>
> >>>>>
> >>>>> But I didn't see any event in this stage had a cost close to 140s.
> What
> >>>>> happened?
> >>>>>
> >>>>>  --- Event Stage 1: Solute_Assembly
> >>>>>
> >>>>> BuildTwoSided       3531 1.0 2.8025e+0026.3 0.00e+00 0.0 3.6e+05
> 4.0e+00 3.5e+03  1  0  2  0  2   1  0 11  0 33     0
> >>>>> BuildTwoSidedF      3531 1.0 2.8678e+0013.2 0.00e+00 0.0 7.1e+05
> 3.6e+03 3.5e+03  1  0  5 17  2   1  0 22 62 33     0
> >>>>> VecScatterBegin     7062 1.0 7.1911e-02 1.9 0.00e+00 0.0 7.1e+05
> 3.5e+02 0.0e+00  0  0  5  2  0   0  0 22  6  0     0
> >>>>> VecScatterEnd       7062 1.0 2.1248e-01 3.0 1.60e+06 2.7 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    73
> >>>>> SFBcastOpBegin      3531 1.0 2.6516e-02 2.4 0.00e+00 0.0 3.6e+05
> 3.5e+02 0.0e+00  0  0  2  1  0   0  0 11  3  0     0
> >>>>> SFBcastOpEnd        3531 1.0 9.5041e-02 4.7 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> >>>>> SFReduceBegin       3531 1.0 3.8955e-02 2.1 0.00e+00 0.0 3.6e+05
> 3.5e+02 0.0e+00  0  0  2  1  0   0  0 11  3  0     0
> >>>>> SFReduceEnd         3531 1.0 1.3791e-01 3.9 1.60e+06 2.7 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   112
> >>>>> SFPack              7062 1.0 6.5591e-03 2.5 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> >>>>> SFUnpack            7062 1.0 7.4186e-03 2.1 1.60e+06 2.7 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2080
> >>>>> MatAssemblyBegin    3531 1.0 4.7846e+00 1.1 0.00e+00 0.0 7.1e+05
> 3.6e+03 3.5e+03  2  0  5 17  2   3  0 22 62 33     0
> >>>>> MatAssemblyEnd      3531 1.0 1.5468e+00 2.7 1.68e+07 2.7 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   1  2  0  0  0   104
> >>>>> MatZeroEntries      3531 1.0 3.0998e-02 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> >>>>>
> >>>>>
> >>>>> --Junchao Zhang
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Mar 23, 2021 at 5:24 PM Mohammad Gohardoust <
> >>>>> gohardoust at gmail.com> wrote:
> >>>>>
> >>>>>> Thanks Dave for your reply.
> >>>>>>
> >>>>>> For sure PETSc is awesome :D
> >>>>>>
> >>>>>> Yes, in both cases petsc was configured with --with-debugging=0 and
> >>>>>> fortunately I do have the old and new -log-veiw outputs which I
> attached.
> >>>>>>
> >>>>>> Best,
> >>>>>> Mohammad
> >>>>>>
> >>>>>> On Tue, Mar 23, 2021 at 1:37 AM Dave May <dave.mayhem23 at gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Nice to hear!
> >>>>>>> The answer is simple, PETSc is awesome :)
> >>>>>>>
> >>>>>>> Jokes aside, assuming both petsc builds were configured with
> >>>>>>> —with-debugging=0, I don’t think there is a definitive answer to
> your
> >>>>>>> question with the information you provided.
> >>>>>>>
> >>>>>>> It could be as simple as one specific implementation you use was
> >>>>>>> improved between petsc releases. Not being an Ubuntu expert, the
> change
> >>>>>>> might be associated with using a different compiler, and or a more
> >>>>>>> efficient BLAS implementation (non threaded vs threaded). However
> I doubt
> >>>>>>> this is the origin of your 2x performance increase.
> >>>>>>>
> >>>>>>> If you really want to understand where the performance improvement
> >>>>>>> originated from, you’d need to send to the email list the result of
> >>>>>>> -log_view from both the old and new versions, running the exact
> same
> >>>>>>> problem.
> >>>>>>>
> >>>>>>> From that info, we can see what implementations in PETSc are being
> >>>>>>> used and where the time reduction is occurring. Knowing that, it
> should be
> >>>>>>> clearer to provide an explanation for it.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Dave
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue 23. Mar 2021 at 06:24, Mohammad Gohardoust <
> >>>>>>> gohardoust at gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I am using a code which is based on petsc (and also parmetis).
> >>>>>>>> Recently I made the following changes and now the code is running
> about two
> >>>>>>>> times faster than before:
> >>>>>>>>
> >>>>>>>>    - Upgraded Ubuntu 18.04 to 20.04
> >>>>>>>>    - Upgraded petsc 3.13.4 to 3.14.5
> >>>>>>>>    - This time I installed parmetis and metis directly via petsc
> by
> >>>>>>>>    --download-parmetis --download-metis flags instead of
> installing them
> >>>>>>>>    separately and using --with-parmetis-include=... and
> >>>>>>>>    --with-parmetis-lib=... (the version of installed parmetis was
> 4.0.3 before)
> >>>>>>>>
> >>>>>>>> I was wondering what can possibly explain this speedup? Does
> anyone
> >>>>>>>> have any suggestions?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Mohammad
> >>>>>>>>
> >>>>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210328/1ce0cc41/attachment.html>