[petsc-users] Suspect Poor Performance of Petsc
Barry Smith
bsmith at mcs.anl.gov
Thu Nov 17 13:12:58 CST 2016
> On Nov 17, 2016, at 12:25 PM, Ivano Barletta <ibarletta at inogs.it> wrote:
>
> Thank you for your replies
>
> I've carried out 3 new tests with 4,8,16 cores, adding the
> code lines suggested by Barry (logs in attachment )
>
> Still lack of scaling persist, but maybe is just related to the size
> of the problem.
>
> As far as balancing concerns, there's not very much I can do, I
> believe it depends on land/sea points, (on land matrix coefficients
> are zero), it's something I cannot control.
Sure you can. You need to divide the rows between processes to balance the work per process (not just balance the number of rows per process), to first order you can assume the work is proportional to the number of nonzeros in the matrix on that process. Thus you want to divide up the rows so each process has the same number of nonzeros (this is not perfect but is a start).
:VecNorm 65 1.0 7.4379e-02 1.0 8.87e+05 1.0 0.0e+00 0.0e+00 6.5e+01 72 9 0 0 31 82 9 0 0 34 47
:VecNorm 67 1.0 7.8646e-02 1.0 4.62e+05 1.0 0.0e+00 0.0e+00 6.7e+01 43 9 0 0 31 72 9 0 0 34 46
0:VecNorm 67 1.0 2.0528e-01 1.2 2.34e+05 1.0 0.0e+00 0.0e+00 6.7e+01 32 9 0 0 31 63 9 0 0 34 18
The time "spent" in VecNorm goes from 63 to 82 percent of the total solve time. This is due to the load imbalance. You absolutely have to deal properly with the load imbalance or you will never get good parallel speedup (regardless of problem size).
Note: When you run on one process with -pc_type jacobi the KSPSolve time should be pretty close to the time of your other solver, if it is much larger than we need to figure out what extra work the PETSc solver is doing because the time for a basic solve using the same preconditioner with two different solvers is dictated by hardware, not software.
Barry
>
> By the way, I mean to carry some test on higher resolution
> configurations, with a bigger problem size
>
> Kind Regards
> Ivano
>
>
> 2016-11-17 17:24 GMT+01:00 Barry Smith <bsmith at mcs.anl.gov>:
>
> Ivano,
>
> I have cut and pasted the relevant parts of the logs below and removed a few irrelevent lines to make the analysis simplier.
>
> There is a lot of bad stuff going on that is hurting performance.
>
> 1) The percentage of the time in the linear solve is getting lower going from 79% with 4 processes to 54 % with 16 processes. This means the rest of the code is not scaling well, likely that is generating the matrix. How are you getting the matrix into the program? If you are reading it as ASCII (somehow in parallel?) you should not do that. You should use MatLoad() to get the matrix in efficiently (see for example src/ksp/ksp/examples/tutorials/ex10.c).
>
> To make the analysis better you need to add around the KSP solve.
>
> ierr = PetscLogStageRegister("Solve", &stage);CHKERRQ(ierr);
> ierr = PetscLogStagePush(stage);CHKERRQ(ierr);
> ierr = KSPSolve(ksp,bb,xx);CHKERRQ(ierr);
> ierr = PetscLogStagePop();CHKERRQ(ierr);
>
> and rerun the three cases.
>
> 2) The load balance is bad even for four processes. For example it is 1.3 in the MatSolve, it should be really close to 1.0. How are you dividing the matrix up between processes?
>
> 3) It is spending a HUGE amount of time in VecNorm(), 26 % on 4 processes and 42% on 16 processes. This could be partially or completely due to load balancing but might have other issues.
>
> Run with -ksp_norm_type natural in your new sets of runs
>
> Also always run with -ksp_type cg ; it makes no sense to use gmres or the other KSP methods.
>
> Eagerly awaiting your response.
>
> Barry
>
>
>
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flops --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> MatMult 75 1.0 4.1466e-03 1.2 3.70e+06 1.6 6.0e+02 4.4e+02 0.0e+00 13 25 97 99 0 13 25 97 99 0 2868
> MatSolve 75 1.0 6.1995e-03 1.3 3.68e+06 1.6 0.0e+00 0.0e+00 0.0e+00 19 24 0 0 0 19 24 0 0 0 1908
> MatLUFactorNum 1 1.0 3.6880e-04 1.4 5.81e+04 1.7 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 499
> MatILUFactorSym 1 1.0 1.7040e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> MatAssemblyBegin 1 1.0 2.5113e-04 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 1 0 0 0 1 1 0 0 0 1 0
> MatAssemblyEnd 1 1.0 1.8365e-03 1.0 0.00e+00 0.0 1.6e+01 1.1e+02 8.0e+00 6 0 3 1 3 6 0 3 1 3 0
> MatGetRowIJ 1 1.0 2.2865e-0517.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetOrdering 1 1.0 6.1687e-05 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecTDot 150 1.0 3.2991e-03 2.7 2.05e+06 1.0 0.0e+00 0.0e+00 1.5e+02 8 17 0 0 62 8 17 0 0 62 2466
> VecNorm 76 1.0 7.5034e-03 1.0 1.04e+06 1.0 0.0e+00 0.0e+00 7.6e+01 26 9 0 0 31 26 9 0 0 31 549
> VecSet 77 1.0 2.4495e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecAXPY 150 1.0 7.8158e-04 1.1 2.05e+06 1.0 0.0e+00 0.0e+00 0.0e+00 3 17 0 0 0 3 17 0 0 0 10409
> VecAYPX 74 1.0 6.8849e-04 1.0 1.01e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 8 0 0 0 2 8 0 0 0 5829
> VecScatterBegin 75 1.0 1.7794e-04 1.2 0.00e+00 0.0 6.0e+02 4.4e+02 0.0e+00 1 0 97 99 0 1 0 97 99 0 0
> VecScatterEnd 75 1.0 2.1674e-04 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> KSPSetUp 2 1.0 1.4922e-04 4.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 1 1.0 2.2833e-02 1.0 1.36e+07 1.3 6.0e+02 4.4e+02 2.3e+02 79100 97 99 93 79100 97 99 93 2116
> PCSetUp 2 1.0 1.0116e-03 1.2 5.81e+04 1.7 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 182
> PCSetUpOnBlocks 1 1.0 6.2872e-04 1.2 5.81e+04 1.7 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 293
> PCApply 75 1.0 7.2835e-03 1.3 3.68e+06 1.6 0.0e+00 0.0e+00 0.0e+00 22 24 0 0 0 22 24 0 0 0 1624
>
> MatMult 77 1.0 3.5985e-03 1.2 2.18e+06 2.4 1.5e+03 3.8e+02 0.0e+00 1 25 97 99 0 1 25 97 99 0 3393
> MatSolve 77 1.0 3.8145e-03 1.4 2.16e+06 2.4 0.0e+00 0.0e+00 0.0e+00 1 24 0 0 0 1 24 0 0 0 3163
> MatLUFactorNum 1 1.0 9.3037e-04 1.9 3.37e+04 2.6 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 196
> MatILUFactorSym 1 1.0 2.1638e-03 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyBegin 1 1.0 1.9466e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 1 0 0 0 1 1 0 0 0 1 0
> MatAssemblyEnd 1 1.0 2.1234e-02 1.0 0.00e+00 0.0 4.0e+01 9.6e+01 8.0e+00 8 0 3 1 3 8 0 3 1 3 0
> MatGetRowIJ 1 1.0 1.0025e-0312.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetOrdering 1 1.0 1.4848e-03 2.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecTDot 154 1.0 4.1220e-03 1.8 1.06e+06 1.0 0.0e+00 0.0e+00 1.5e+02 1 17 0 0 62 1 17 0 0 62 2026
> VecNorm 78 1.0 1.5534e-01 1.0 5.38e+05 1.0 0.0e+00 0.0e+00 7.8e+01 60 9 0 0 31 60 9 0 0 31 27
> VecSet 79 1.0 1.5549e-03 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAXPY 154 1.0 8.0559e-04 1.2 1.06e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 17 0 0 0 0 17 0 0 0 10368
> VecAYPX 76 1.0 5.8600e-04 1.4 5.24e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 8 0 0 0 0 8 0 0 0 7034
> VecScatterBegin 77 1.0 8.4793e-04 3.7 0.00e+00 0.0 1.5e+03 3.8e+02 0.0e+00 0 0 97 99 0 0 0 97 99 0 0
> VecScatterEnd 77 1.0 7.7019e-04 2.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSetUp 2 1.0 1.1451e-03 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 1 1.0 1.8231e-01 1.0 7.49e+06 1.5 1.5e+03 3.8e+02 2.3e+02 71100 97 99 93 71100 97 99 94 272
> PCSetUp 2 1.0 1.0994e-02 1.1 3.37e+04 2.6 0.0e+00 0.0e+00 0.0e+00 4 0 0 0 0 4 0 0 0 0 17
> PCSetUpOnBlocks 1 1.0 4.9001e-03 1.2 3.37e+04 2.6 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 37
> PCApply 77 1.0 5.2556e-03 1.3 2.16e+06 2.4 0.0e+00 0.0e+00 0.0e+00 2 24 0 0 0 2 24 0 0 0 2296
>
> MatMult 78 1.0 1.2783e-02 4.8 1.16e+06 3.9 3.5e+03 2.5e+02 0.0e+00 1 25 98 99 0 1 25 98 99 0 968
> MatSolve 78 1.0 1.4015e-0214.0 1.14e+06 3.9 0.0e+00 0.0e+00 0.0e+00 0 24 0 0 0 0 24 0 0 0 867
> MatLUFactorNum 1 1.0 1.0275e-0240.1 1.76e+04 4.5 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 18
> MatILUFactorSym 1 1.0 2.0541e-0213.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> MatAssemblyBegin 1 1.0 2.1347e-02 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 1 0 0 0 1 1 0 0 0 1 0
> MatAssemblyEnd 1 1.0 1.5367e-01 1.1 0.00e+00 0.0 9.0e+01 6.5e+01 8.0e+00 12 0 2 1 3 12 0 2 1 3 0
> MatGetRowIJ 1 1.0 1.2759e-02159.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetOrdering 1 1.0 1.8199e-0221.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecTDot 156 1.0 1.3093e-02 6.1 5.45e+05 1.0 0.0e+00 0.0e+00 1.6e+02 1 17 0 0 62 1 17 0 0 62 646
> VecNorm 79 1.0 5.2373e-01 1.0 2.76e+05 1.0 0.0e+00 0.0e+00 7.9e+01 42 9 0 0 31 42 9 0 0 31 8
> VecSet 80 1.0 2.1215e-0229.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecAXPY 156 1.0 2.5283e-03 1.7 5.45e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 17 0 0 0 0 17 0 0 0 3346
> VecAYPX 77 1.0 1.5826e-03 2.6 2.69e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 8 0 0 0 0 8 0 0 0 2639
> VecScatterBegin 78 1.0 7.8273e-0326.8 0.00e+00 0.0 3.5e+03 2.5e+02 0.0e+00 0 0 98 99 0 0 0 98 99 0 0
> VecScatterEnd 78 1.0 4.8130e-0344.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSetUp 2 1.0 1.9786e-0232.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> KSPSolve 1 1.0 6.7540e-01 1.0 3.87e+06 1.8 3.5e+03 2.5e+02 2.4e+02 54100 98 99 93 54100 98 99 94 74
> PCSetUp 2 1.0 9.6539e-02 1.2 1.76e+04 4.5 0.0e+00 0.0e+00 0.0e+00 7 0 0 0 0 7 0 0 0 0 2
> PCSetUpOnBlocks 1 1.0 5.1548e-02 1.8 1.76e+04 4.5 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 4
> PCApply 78 1.0 1.7296e-02 5.3 1.14e+06 3.9 0.0e+00 0.0e+00 0.0e+00 1 24 0 0 0 1 24 0 0 0 702
> ------------------------------------------------------------------------------------------------------------------------
>
>
>
>
> > On Nov 17, 2016, at 6:28 AM, Ivano Barletta <ibarletta at inogs.it> wrote:
> >
> > Dear Petsc users
> >
> > My aim is to replace the linear solver of an ocean model with Petsc, to see if is
> > there place for improvement of performances.
> >
> > The linear system solves an elliptic equation, and the former solver is a
> > Preconditioned Conjugate Gradient, with a simple diagonal preconditioning.
> > The size of the matrix is roughly 27000
> >
> > Prior to nest Petsc into the model, I've built a simple test case, where
> > the same system is solved by the two of the methods
> >
> > I've noticed that, compared to the former solver (pcg), Petsc performance
> > results are quite disappointing
> >
> > Pcg does not scale that much, but its solution time remains below
> > 4-5e-2 seconds.
> > Petsc solution time, instead, the more cpu I use the more increases
> > (see output of -log_view in attachment ).
> >
> > I've only tried to change the ksp solver ( gmres, cg, and bcgs with no
> > improvement) and preconditioning is the default of Petsc. Maybe these
> > options don't suit my problem very well, but I don't think this alone
> > justifies this strange behavior
> >
> > I've tried to provide d_nnz and o_nnz for the exact number of nonzeros in the
> > Preallocation phase, but no gain even in this case.
> >
> > At this point, my question is, what am I doing wrong?
> >
> > Do you think that the problem is too small for the Petsc to
> > have any effect?
> >
> > Thanks in advance
> > Ivano
> >
> > <petsc_time_8><petsc_time_4><petsc_time_16>
>
>
> <petsc_log_8><petsc_log_4><petsc_log_16>
More information about the petsc-users
mailing list