[petsc-users] Code speedup after upgrading

Sun Mar 28 15:33:46 CDT 2021

I take it this was using MAT_SUBSET_OFF_PROC_ENTRIES. I implemented that to help performance of PHASTA and other applications that assemble matrices that are relatively cheap to solve (so assembly cost is significant compared to preconditioner setup and KSPSolve) and I'm glad it helps so much here.

I don't have an explanation for why you're observing local vector operations like VecScale and VecMAXPY running over twice as fast in the new code. These consist of simple code that has not changed, and which are normally memory bandwidth limited (though some of your problem sizes might fit in cache).  

Mohammad Gohardoust <gohardoust at gmail.com> writes:

> Here is the plot of run time in old and new petsc using 1,2,4,8, and 16
> CPUs (in logarithmic scale):
>
> [image: Screenshot from 2021-03-28 10-48-56.png]
>
>
>
>
> On Thu, Mar 25, 2021 at 12:51 PM Mohammad Gohardoust <gohardoust at gmail.com>
> wrote:
>
>> That's right, these loops also take roughly half time as well. If I am not
>> mistaken, petsc (MatSetValue) is called after doing some calculations over
>> each tetrahedral element.
>> Thanks for your suggestion. I will try that and will post the results.
>>
>> Mohammad
>>
>> On Wed, Mar 24, 2021 at 3:23 PM Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>>>
>>>
>>>
>>> On Wed, Mar 24, 2021 at 2:17 AM Mohammad Gohardoust <gohardoust at gmail.com>
>>> wrote:
>>>
>>>> So the code itself is a finite-element scheme and in stage 1 and 3 there
>>>> are expensive loops over entire mesh elements which consume a lot of time.
>>>>
>>> So these expensive loops must also take half time with newer petsc?  And
>>> these loops do not call petsc routines?
>>> I think you can build two PETSc versions with the same configuration
>>> options, then run your code with one MPI rank to see if there is a
>>> difference.
>>> If they give the same performance, then scale to 2, 4, ... ranks and see
>>> what happens.
>>>
>>>
>>>
>>>>
>>>> Mohammad
>>>>
>>>> On Tue, Mar 23, 2021 at 6:08 PM Junchao Zhang <junchao.zhang at gmail.com>
>>>> wrote:
>>>>
>>>>> In the new log, I saw
>>>>>
>>>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  -- Message Lengths --  -- Reductions --
>>>>>                         Avg     %Total     Avg     %Total    Count   %Total     Avg         %Total    Count   %Total
>>>>>  0:      Main Stage: 5.4095e+00   2.3%  4.3700e+03   0.0%  4.764e+05   3.0%  3.135e+02        1.0%  2.244e+04  12.6% 1: Solute_Assembly: 1.3977e+02  59.4%  7.3353e+09   4.6%  3.263e+06  20.7%  1.278e+03       26.9%  1.059e+04   6.0%
>>>>>
>>>>>
>>>>> But I didn't see any event in this stage had a cost close to 140s. What
>>>>> happened?
>>>>>
>>>>>  --- Event Stage 1: Solute_Assembly
>>>>>
>>>>> BuildTwoSided       3531 1.0 2.8025e+0026.3 0.00e+00 0.0 3.6e+05 4.0e+00 3.5e+03  1  0  2  0  2   1  0 11  0 33     0
>>>>> BuildTwoSidedF      3531 1.0 2.8678e+0013.2 0.00e+00 0.0 7.1e+05 3.6e+03 3.5e+03  1  0  5 17  2   1  0 22 62 33     0
>>>>> VecScatterBegin     7062 1.0 7.1911e-02 1.9 0.00e+00 0.0 7.1e+05 3.5e+02 0.0e+00  0  0  5  2  0   0  0 22  6  0     0
>>>>> VecScatterEnd       7062 1.0 2.1248e-01 3.0 1.60e+06 2.7 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    73
>>>>> SFBcastOpBegin      3531 1.0 2.6516e-02 2.4 0.00e+00 0.0 3.6e+05 3.5e+02 0.0e+00  0  0  2  1  0   0  0 11  3  0     0
>>>>> SFBcastOpEnd        3531 1.0 9.5041e-02 4.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> SFReduceBegin       3531 1.0 3.8955e-02 2.1 0.00e+00 0.0 3.6e+05 3.5e+02 0.0e+00  0  0  2  1  0   0  0 11  3  0     0
>>>>> SFReduceEnd         3531 1.0 1.3791e-01 3.9 1.60e+06 2.7 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   112
>>>>> SFPack              7062 1.0 6.5591e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> SFUnpack            7062 1.0 7.4186e-03 2.1 1.60e+06 2.7 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2080
>>>>> MatAssemblyBegin    3531 1.0 4.7846e+00 1.1 0.00e+00 0.0 7.1e+05 3.6e+03 3.5e+03  2  0  5 17  2   3  0 22 62 33     0
>>>>> MatAssemblyEnd      3531 1.0 1.5468e+00 2.7 1.68e+07 2.7 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  2  0  0  0   104
>>>>> MatZeroEntries      3531 1.0 3.0998e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>
>>>>>
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 23, 2021 at 5:24 PM Mohammad Gohardoust <
>>>>> gohardoust at gmail.com> wrote:
>>>>>
>>>>>> Thanks Dave for your reply.
>>>>>>
>>>>>> For sure PETSc is awesome :D
>>>>>>
>>>>>> Yes, in both cases petsc was configured with --with-debugging=0 and
>>>>>> fortunately I do have the old and new -log-veiw outputs which I attached.
>>>>>>
>>>>>> Best,
>>>>>> Mohammad
>>>>>>
>>>>>> On Tue, Mar 23, 2021 at 1:37 AM Dave May <dave.mayhem23 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Nice to hear!
>>>>>>> The answer is simple, PETSc is awesome :)
>>>>>>>
>>>>>>> Jokes aside, assuming both petsc builds were configured with
>>>>>>> —with-debugging=0, I don’t think there is a definitive answer to your
>>>>>>> question with the information you provided.
>>>>>>>
>>>>>>> It could be as simple as one specific implementation you use was
>>>>>>> improved between petsc releases. Not being an Ubuntu expert, the change
>>>>>>> might be associated with using a different compiler, and or a more
>>>>>>> efficient BLAS implementation (non threaded vs threaded). However I doubt
>>>>>>> this is the origin of your 2x performance increase.
>>>>>>>
>>>>>>> If you really want to understand where the performance improvement
>>>>>>> originated from, you’d need to send to the email list the result of
>>>>>>> -log_view from both the old and new versions, running the exact same
>>>>>>> problem.
>>>>>>>
>>>>>>> From that info, we can see what implementations in PETSc are being
>>>>>>> used and where the time reduction is occurring. Knowing that, it should be
>>>>>>> clearer to provide an explanation for it.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Dave
>>>>>>>
>>>>>>>
>>>>>>> On Tue 23. Mar 2021 at 06:24, Mohammad Gohardoust <
>>>>>>> gohardoust at gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am using a code which is based on petsc (and also parmetis).
>>>>>>>> Recently I made the following changes and now the code is running about two
>>>>>>>> times faster than before:
>>>>>>>>
>>>>>>>>    - Upgraded Ubuntu 18.04 to 20.04
>>>>>>>>    - Upgraded petsc 3.13.4 to 3.14.5
>>>>>>>>    - This time I installed parmetis and metis directly via petsc by
>>>>>>>>    --download-parmetis --download-metis flags instead of installing them
>>>>>>>>    separately and using --with-parmetis-include=... and
>>>>>>>>    --with-parmetis-lib=... (the version of installed parmetis was 4.0.3 before)
>>>>>>>>
>>>>>>>> I was wondering what can possibly explain this speedup? Does anyone
>>>>>>>> have any suggestions?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Mohammad
>>>>>>>>
>>>>>>>