[petsc-users] Code speedup after upgrading
Jed Brown
jed at jedbrown.org
Sun Mar 28 22:11:35 CDT 2021
It's an option that he would have set explicitly via MatSetOption, following Lawrence's suggestion. He can either not call that function or use PETSC_FALSE to unset it.
Junchao Zhang <junchao.zhang at gmail.com> writes:
> Is there an option to turn off MAT_SUBSET_OFF_PROC_ENTRIES for Mohammad to
> try?
>
> --Junchao Zhang
>
>
> On Sun, Mar 28, 2021 at 3:34 PM Jed Brown <jed at jedbrown.org> wrote:
>
>> I take it this was using MAT_SUBSET_OFF_PROC_ENTRIES. I implemented that
>> to help performance of PHASTA and other applications that assemble matrices
>> that are relatively cheap to solve (so assembly cost is significant
>> compared to preconditioner setup and KSPSolve) and I'm glad it helps so
>> much here.
>>
>> I don't have an explanation for why you're observing local vector
>> operations like VecScale and VecMAXPY running over twice as fast in the new
>> code. These consist of simple code that has not changed, and which are
>> normally memory bandwidth limited (though some of your problem sizes might
>> fit in cache).
>>
>> Mohammad Gohardoust <gohardoust at gmail.com> writes:
>>
>> > Here is the plot of run time in old and new petsc using 1,2,4,8, and 16
>> > CPUs (in logarithmic scale):
>> >
>> > [image: Screenshot from 2021-03-28 10-48-56.png]
>> >
>> >
>> >
>> >
>> > On Thu, Mar 25, 2021 at 12:51 PM Mohammad Gohardoust <
>> gohardoust at gmail.com>
>> > wrote:
>> >
>> >> That's right, these loops also take roughly half time as well. If I am
>> not
>> >> mistaken, petsc (MatSetValue) is called after doing some calculations
>> over
>> >> each tetrahedral element.
>> >> Thanks for your suggestion. I will try that and will post the results.
>> >>
>> >> Mohammad
>> >>
>> >> On Wed, Mar 24, 2021 at 3:23 PM Junchao Zhang <junchao.zhang at gmail.com>
>> >> wrote:
>> >>
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Mar 24, 2021 at 2:17 AM Mohammad Gohardoust <
>> gohardoust at gmail.com>
>> >>> wrote:
>> >>>
>> >>>> So the code itself is a finite-element scheme and in stage 1 and 3
>> there
>> >>>> are expensive loops over entire mesh elements which consume a lot of
>> time.
>> >>>>
>> >>> So these expensive loops must also take half time with newer petsc?
>> And
>> >>> these loops do not call petsc routines?
>> >>> I think you can build two PETSc versions with the same configuration
>> >>> options, then run your code with one MPI rank to see if there is a
>> >>> difference.
>> >>> If they give the same performance, then scale to 2, 4, ... ranks and
>> see
>> >>> what happens.
>> >>>
>> >>>
>> >>>
>> >>>>
>> >>>> Mohammad
>> >>>>
>> >>>> On Tue, Mar 23, 2021 at 6:08 PM Junchao Zhang <
>> junchao.zhang at gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> In the new log, I saw
>> >>>>>
>> >>>>> Summary of Stages: ----- Time ------ ----- Flop ------ ---
>> Messages --- -- Message Lengths -- -- Reductions --
>> >>>>> Avg %Total Avg %Total Count
>> %Total Avg %Total Count %Total
>> >>>>> 0: Main Stage: 5.4095e+00 2.3% 4.3700e+03 0.0%
>> 4.764e+05 3.0% 3.135e+02 1.0% 2.244e+04 12.6% 1:
>> Solute_Assembly: 1.3977e+02 59.4% 7.3353e+09 4.6% 3.263e+06 20.7%
>> 1.278e+03 26.9% 1.059e+04 6.0%
>> >>>>>
>> >>>>>
>> >>>>> But I didn't see any event in this stage had a cost close to 140s.
>> What
>> >>>>> happened?
>> >>>>>
>> >>>>> --- Event Stage 1: Solute_Assembly
>> >>>>>
>> >>>>> BuildTwoSided 3531 1.0 2.8025e+0026.3 0.00e+00 0.0 3.6e+05
>> 4.0e+00 3.5e+03 1 0 2 0 2 1 0 11 0 33 0
>> >>>>> BuildTwoSidedF 3531 1.0 2.8678e+0013.2 0.00e+00 0.0 7.1e+05
>> 3.6e+03 3.5e+03 1 0 5 17 2 1 0 22 62 33 0
>> >>>>> VecScatterBegin 7062 1.0 7.1911e-02 1.9 0.00e+00 0.0 7.1e+05
>> 3.5e+02 0.0e+00 0 0 5 2 0 0 0 22 6 0 0
>> >>>>> VecScatterEnd 7062 1.0 2.1248e-01 3.0 1.60e+06 2.7 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 73
>> >>>>> SFBcastOpBegin 3531 1.0 2.6516e-02 2.4 0.00e+00 0.0 3.6e+05
>> 3.5e+02 0.0e+00 0 0 2 1 0 0 0 11 3 0 0
>> >>>>> SFBcastOpEnd 3531 1.0 9.5041e-02 4.7 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> >>>>> SFReduceBegin 3531 1.0 3.8955e-02 2.1 0.00e+00 0.0 3.6e+05
>> 3.5e+02 0.0e+00 0 0 2 1 0 0 0 11 3 0 0
>> >>>>> SFReduceEnd 3531 1.0 1.3791e-01 3.9 1.60e+06 2.7 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 112
>> >>>>> SFPack 7062 1.0 6.5591e-03 2.5 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> >>>>> SFUnpack 7062 1.0 7.4186e-03 2.1 1.60e+06 2.7 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2080
>> >>>>> MatAssemblyBegin 3531 1.0 4.7846e+00 1.1 0.00e+00 0.0 7.1e+05
>> 3.6e+03 3.5e+03 2 0 5 17 2 3 0 22 62 33 0
>> >>>>> MatAssemblyEnd 3531 1.0 1.5468e+00 2.7 1.68e+07 2.7 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 1 2 0 0 0 104
>> >>>>> MatZeroEntries 3531 1.0 3.0998e-02 1.2 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> >>>>>
>> >>>>>
>> >>>>> --Junchao Zhang
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Mar 23, 2021 at 5:24 PM Mohammad Gohardoust <
>> >>>>> gohardoust at gmail.com> wrote:
>> >>>>>
>> >>>>>> Thanks Dave for your reply.
>> >>>>>>
>> >>>>>> For sure PETSc is awesome :D
>> >>>>>>
>> >>>>>> Yes, in both cases petsc was configured with --with-debugging=0 and
>> >>>>>> fortunately I do have the old and new -log-veiw outputs which I
>> attached.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Mohammad
>> >>>>>>
>> >>>>>> On Tue, Mar 23, 2021 at 1:37 AM Dave May <dave.mayhem23 at gmail.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Nice to hear!
>> >>>>>>> The answer is simple, PETSc is awesome :)
>> >>>>>>>
>> >>>>>>> Jokes aside, assuming both petsc builds were configured with
>> >>>>>>> —with-debugging=0, I don’t think there is a definitive answer to
>> your
>> >>>>>>> question with the information you provided.
>> >>>>>>>
>> >>>>>>> It could be as simple as one specific implementation you use was
>> >>>>>>> improved between petsc releases. Not being an Ubuntu expert, the
>> change
>> >>>>>>> might be associated with using a different compiler, and or a more
>> >>>>>>> efficient BLAS implementation (non threaded vs threaded). However
>> I doubt
>> >>>>>>> this is the origin of your 2x performance increase.
>> >>>>>>>
>> >>>>>>> If you really want to understand where the performance improvement
>> >>>>>>> originated from, you’d need to send to the email list the result of
>> >>>>>>> -log_view from both the old and new versions, running the exact
>> same
>> >>>>>>> problem.
>> >>>>>>>
>> >>>>>>> From that info, we can see what implementations in PETSc are being
>> >>>>>>> used and where the time reduction is occurring. Knowing that, it
>> should be
>> >>>>>>> clearer to provide an explanation for it.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> Dave
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Tue 23. Mar 2021 at 06:24, Mohammad Gohardoust <
>> >>>>>>> gohardoust at gmail.com> wrote:
>> >>>>>>>
>> >>>>>>>> Hi,
>> >>>>>>>>
>> >>>>>>>> I am using a code which is based on petsc (and also parmetis).
>> >>>>>>>> Recently I made the following changes and now the code is running
>> about two
>> >>>>>>>> times faster than before:
>> >>>>>>>>
>> >>>>>>>> - Upgraded Ubuntu 18.04 to 20.04
>> >>>>>>>> - Upgraded petsc 3.13.4 to 3.14.5
>> >>>>>>>> - This time I installed parmetis and metis directly via petsc
>> by
>> >>>>>>>> --download-parmetis --download-metis flags instead of
>> installing them
>> >>>>>>>> separately and using --with-parmetis-include=... and
>> >>>>>>>> --with-parmetis-lib=... (the version of installed parmetis was
>> 4.0.3 before)
>> >>>>>>>>
>> >>>>>>>> I was wondering what can possibly explain this speedup? Does
>> anyone
>> >>>>>>>> have any suggestions?
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> Mohammad
>> >>>>>>>>
>> >>>>>>>
>>
More information about the petsc-users
mailing list