[petsc-users] Suspect Poor Performance of Petsc

Barry Smith bsmith at mcs.anl.gov
Thu Nov 17 13:12:58 CST 2016


> On Nov 17, 2016, at 12:25 PM, Ivano Barletta <ibarletta at inogs.it> wrote:
> 
> Thank you for your replies
> 
> I've carried out 3 new tests with 4,8,16 cores, adding the 
> code lines suggested by Barry (logs in attachment )
> 
> Still lack of scaling persist, but maybe is just related to the size
> of the problem. 
> 
> As far as balancing concerns, there's not very much I can do, I 
> believe it depends on land/sea points, (on land matrix coefficients 
> are zero), it's something I cannot control.

   Sure you can. You need to divide the rows between processes to balance the work per process (not just balance the number of rows per process), to first order you can assume the work is proportional to the number of nonzeros in the matrix on that process. Thus you want to divide up the rows so each process has the same number of nonzeros (this is not perfect but is a start).

:VecNorm               65 1.0 7.4379e-02 1.0 8.87e+05 1.0 0.0e+00 0.0e+00 6.5e+01 72  9  0  0 31  82  9  0  0 34    47
:VecNorm               67 1.0 7.8646e-02 1.0 4.62e+05 1.0 0.0e+00 0.0e+00 6.7e+01 43  9  0  0 31  72  9  0  0 34    46
0:VecNorm               67 1.0 2.0528e-01 1.2 2.34e+05 1.0 0.0e+00 0.0e+00 6.7e+01 32  9  0  0 31  63  9  0  0 34    18

   The time "spent" in VecNorm goes from  63 to 82 percent of the total solve time. This is due to the load imbalance. You absolutely have to deal properly with the load imbalance or you will never get good parallel speedup (regardless of problem size).

   Note: When you run on one process with -pc_type jacobi the KSPSolve time should be pretty close to the time of your other solver, if it is much larger than we need to figure out what extra work the PETSc solver is doing because the time for a basic solve using the same preconditioner with two different solvers is dictated by hardware, not software.


  Barry


> 
> By the way, I mean to carry some test on higher resolution 
> configurations, with a bigger problem size
> 
> Kind Regards
> Ivano
> 
> 
> 2016-11-17 17:24 GMT+01:00 Barry Smith <bsmith at mcs.anl.gov>:
> 
>   Ivano,
> 
> I have cut and pasted the relevant parts of the logs below and removed a few irrelevent lines to make the analysis simplier.
> 
> There is a lot of bad stuff going on that is hurting performance.
> 
> 1) The percentage of the time in the linear solve is getting lower going from 79% with 4 processes to 54 % with 16 processes. This means the rest of the code is not scaling well, likely that is generating the matrix. How are you getting the matrix into the program? If you are reading it as ASCII (somehow in parallel?) you should not do that. You should use MatLoad() to get the matrix in efficiently (see for example src/ksp/ksp/examples/tutorials/ex10.c).
> 
> To make the analysis better you need to add around the KSP solve.
> 
>   ierr = PetscLogStageRegister("Solve", &stage);CHKERRQ(ierr);
>   ierr = PetscLogStagePush(stage);CHKERRQ(ierr);
>   ierr = KSPSolve(ksp,bb,xx);CHKERRQ(ierr);
>   ierr = PetscLogStagePop();CHKERRQ(ierr);
> 
> and rerun the three cases.
> 
> 2) The load balance is bad even for four processes. For example it is 1.3 in the MatSolve, it should be really close to 1.0. How are you dividing the matrix up between processes?
> 
> 3) It is spending a HUGE amount of time in VecNorm(), 26 % on 4 processes and 42% on 16 processes. This could be partially or completely due to load balancing but might have other issues.
> 
> Run with -ksp_norm_type natural in your new sets of runs
> 
> Also always run with -ksp_type cg ; it makes no sense to use gmres or the other KSP methods.
> 
> Eagerly awaiting your response.
> 
> Barry
> 
> 
> 
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> MatMult               75 1.0 4.1466e-03 1.2 3.70e+06 1.6 6.0e+02 4.4e+02 0.0e+00 13 25 97 99  0  13 25 97 99  0  2868
> MatSolve              75 1.0 6.1995e-03 1.3 3.68e+06 1.6 0.0e+00 0.0e+00 0.0e+00 19 24  0  0  0  19 24  0  0  0  1908
> MatLUFactorNum         1 1.0 3.6880e-04 1.4 5.81e+04 1.7 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0   499
> MatILUFactorSym        1 1.0 1.7040e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatAssemblyBegin       1 1.0 2.5113e-04 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  1  0  0  0  1   1  0  0  0  1     0
> MatAssemblyEnd         1 1.0 1.8365e-03 1.0 0.00e+00 0.0 1.6e+01 1.1e+02 8.0e+00  6  0  3  1  3   6  0  3  1  3     0
> MatGetRowIJ            1 1.0 2.2865e-0517.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 6.1687e-05 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecTDot              150 1.0 3.2991e-03 2.7 2.05e+06 1.0 0.0e+00 0.0e+00 1.5e+02  8 17  0  0 62   8 17  0  0 62  2466
> VecNorm               76 1.0 7.5034e-03 1.0 1.04e+06 1.0 0.0e+00 0.0e+00 7.6e+01 26  9  0  0 31  26  9  0  0 31   549
> VecSet                77 1.0 2.4495e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY              150 1.0 7.8158e-04 1.1 2.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00  3 17  0  0  0   3 17  0  0  0 10409
> VecAYPX               74 1.0 6.8849e-04 1.0 1.01e+06 1.0 0.0e+00 0.0e+00 0.0e+00  2  8  0  0  0   2  8  0  0  0  5829
> VecScatterBegin       75 1.0 1.7794e-04 1.2 0.00e+00 0.0 6.0e+02 4.4e+02 0.0e+00  1  0 97 99  0   1  0 97 99  0     0
> VecScatterEnd         75 1.0 2.1674e-04 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> KSPSetUp               2 1.0 1.4922e-04 4.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 2.2833e-02 1.0 1.36e+07 1.3 6.0e+02 4.4e+02 2.3e+02 79100 97 99 93  79100 97 99 93  2116
> PCSetUp                2 1.0 1.0116e-03 1.2 5.81e+04 1.7 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0   3  0  0  0  0   182
> PCSetUpOnBlocks        1 1.0 6.2872e-04 1.2 5.81e+04 1.7 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0   293
> PCApply               75 1.0 7.2835e-03 1.3 3.68e+06 1.6 0.0e+00 0.0e+00 0.0e+00 22 24  0  0  0  22 24  0  0  0  1624
> 
> MatMult               77 1.0 3.5985e-03 1.2 2.18e+06 2.4 1.5e+03 3.8e+02 0.0e+00  1 25 97 99  0   1 25 97 99  0  3393
> MatSolve              77 1.0 3.8145e-03 1.4 2.16e+06 2.4 0.0e+00 0.0e+00 0.0e+00  1 24  0  0  0   1 24  0  0  0  3163
> MatLUFactorNum         1 1.0 9.3037e-04 1.9 3.37e+04 2.6 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   196
> MatILUFactorSym        1 1.0 2.1638e-03 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin       1 1.0 1.9466e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  1  0  0  0  1   1  0  0  0  1     0
> MatAssemblyEnd         1 1.0 2.1234e-02 1.0 0.00e+00 0.0 4.0e+01 9.6e+01 8.0e+00  8  0  3  1  3   8  0  3  1  3     0
> MatGetRowIJ            1 1.0 1.0025e-0312.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 1.4848e-03 2.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecTDot              154 1.0 4.1220e-03 1.8 1.06e+06 1.0 0.0e+00 0.0e+00 1.5e+02  1 17  0  0 62   1 17  0  0 62  2026
> VecNorm               78 1.0 1.5534e-01 1.0 5.38e+05 1.0 0.0e+00 0.0e+00 7.8e+01 60  9  0  0 31  60  9  0  0 31    27
> VecSet                79 1.0 1.5549e-03 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY              154 1.0 8.0559e-04 1.2 1.06e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0 17  0  0  0   0 17  0  0  0 10368
> VecAYPX               76 1.0 5.8600e-04 1.4 5.24e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  8  0  0  0   0  8  0  0  0  7034
> VecScatterBegin       77 1.0 8.4793e-04 3.7 0.00e+00 0.0 1.5e+03 3.8e+02 0.0e+00  0  0 97 99  0   0  0 97 99  0     0
> VecScatterEnd         77 1.0 7.7019e-04 2.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetUp               2 1.0 1.1451e-03 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 1.8231e-01 1.0 7.49e+06 1.5 1.5e+03 3.8e+02 2.3e+02 71100 97 99 93  71100 97 99 94   272
> PCSetUp                2 1.0 1.0994e-02 1.1 3.37e+04 2.6 0.0e+00 0.0e+00 0.0e+00  4  0  0  0  0   4  0  0  0  0    17
> PCSetUpOnBlocks        1 1.0 4.9001e-03 1.2 3.37e+04 2.6 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0    37
> PCApply               77 1.0 5.2556e-03 1.3 2.16e+06 2.4 0.0e+00 0.0e+00 0.0e+00  2 24  0  0  0   2 24  0  0  0  2296
> 
> MatMult               78 1.0 1.2783e-02 4.8 1.16e+06 3.9 3.5e+03 2.5e+02 0.0e+00  1 25 98 99  0   1 25 98 99  0   968
> MatSolve              78 1.0 1.4015e-0214.0 1.14e+06 3.9 0.0e+00 0.0e+00 0.0e+00  0 24  0  0  0   0 24  0  0  0   867
> MatLUFactorNum         1 1.0 1.0275e-0240.1 1.76e+04 4.5 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    18
> MatILUFactorSym        1 1.0 2.0541e-0213.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatAssemblyBegin       1 1.0 2.1347e-02 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  1  0  0  0  1   1  0  0  0  1     0
> MatAssemblyEnd         1 1.0 1.5367e-01 1.1 0.00e+00 0.0 9.0e+01 6.5e+01 8.0e+00 12  0  2  1  3  12  0  2  1  3     0
> MatGetRowIJ            1 1.0 1.2759e-02159.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 1.8199e-0221.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecTDot              156 1.0 1.3093e-02 6.1 5.45e+05 1.0 0.0e+00 0.0e+00 1.6e+02  1 17  0  0 62   1 17  0  0 62   646
> VecNorm               79 1.0 5.2373e-01 1.0 2.76e+05 1.0 0.0e+00 0.0e+00 7.9e+01 42  9  0  0 31  42  9  0  0 31     8
> VecSet                80 1.0 2.1215e-0229.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY              156 1.0 2.5283e-03 1.7 5.45e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0 17  0  0  0   0 17  0  0  0  3346
> VecAYPX               77 1.0 1.5826e-03 2.6 2.69e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  8  0  0  0   0  8  0  0  0  2639
> VecScatterBegin       78 1.0 7.8273e-0326.8 0.00e+00 0.0 3.5e+03 2.5e+02 0.0e+00  0  0 98 99  0   0  0 98 99  0     0
> VecScatterEnd         78 1.0 4.8130e-0344.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetUp               2 1.0 1.9786e-0232.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> KSPSolve               1 1.0 6.7540e-01 1.0 3.87e+06 1.8 3.5e+03 2.5e+02 2.4e+02 54100 98 99 93  54100 98 99 94    74
> PCSetUp                2 1.0 9.6539e-02 1.2 1.76e+04 4.5 0.0e+00 0.0e+00 0.0e+00  7  0  0  0  0   7  0  0  0  0     2
> PCSetUpOnBlocks        1 1.0 5.1548e-02 1.8 1.76e+04 4.5 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0   3  0  0  0  0     4
> PCApply               78 1.0 1.7296e-02 5.3 1.14e+06 3.9 0.0e+00 0.0e+00 0.0e+00  1 24  0  0  0   1 24  0  0  0   702
> ------------------------------------------------------------------------------------------------------------------------
> 
> 
> 
> 
> > On Nov 17, 2016, at 6:28 AM, Ivano Barletta <ibarletta at inogs.it> wrote:
> >
> > Dear Petsc users
> >
> > My aim is to replace the linear solver of an ocean model with Petsc, to see if is
> > there place for improvement of performances.
> >
> > The linear system solves an elliptic equation, and the former solver is a
> > Preconditioned Conjugate Gradient, with a simple diagonal preconditioning.
> > The size of the matrix is roughly 27000
> >
> > Prior to nest Petsc into the model, I've built a simple test case, where
> > the same system is solved by the two of the methods
> >
> > I've noticed that, compared to the former solver (pcg), Petsc performance
> > results are quite disappointing
> >
> > Pcg does not scale that much, but its solution time remains below
> > 4-5e-2 seconds.
> > Petsc solution time, instead, the more cpu I use the more increases
> > (see output of -log_view in attachment ).
> >
> > I've only tried to change the ksp solver ( gmres, cg, and bcgs with no
> > improvement) and preconditioning is the default of Petsc. Maybe these
> > options don't suit my problem very well, but I don't think this alone
> > justifies this strange behavior
> >
> > I've tried to provide d_nnz and o_nnz for the exact number of nonzeros in the
> > Preallocation phase, but no gain even in this case.
> >
> > At this point, my question is, what am I doing wrong?
> >
> > Do you think that the problem is too small for the Petsc to
> > have any effect?
> >
> > Thanks in advance
> > Ivano
> >
> > <petsc_time_8><petsc_time_4><petsc_time_16>
> 
> 
> <petsc_log_8><petsc_log_4><petsc_log_16>



More information about the petsc-users mailing list