[petsc-users] GAMG speed
Michele Rosso
mrosso at uci.edu
Wed Aug 14 14:34:27 CDT 2013
Jed,
thank you. I will proceed with the multiphase test case and I will let
you know how it goes and eventually, if possible, try the matrix-free
approach.
Nevertheless I would like to give a try FMG with high-order
prolongation: could you explain me how I can do that?
Thank you.
Michele
On 08/14/2013 04:35 AM, Jed Brown wrote:
> Please always use "reply-all" so that your messages go to the list.
> This is standard mailing list etiquette. It is important to preserve
> threading for people who find this discussion later and so that we do
> not waste our time re-answering the same questions that have already
> been answered in private side-conversations. You'll likely get an
> answer faster that way too.
>
> Michele Rosso <mrosso at uci.edu> writes:
>
>> Jed,
>>
>> thank you very much for the detailed analysis.
>> I confirm that
>>
>> -log_summary -ksp_monitor -ksp_view -ksp_converged_reason -pc_type mg
>> -pc_mg_galerkin -pc_mg_levels 5 -mg_levels_ksp_type richardson
>> -mg_levels_ksp_max_it 1
>>
>> results in a faster solve with the 256^3 grid (it finally beats CG + ICC).
>>
>> Yes, I perform different solves: I setup matrix and KSP only once at the beginning of the run and then I re-use them
>> at each time step. The rhs term changes during the simulation though. For now (single phase flow) the matrix does not change,
>> but I will be dealing soon with a multiphase flow and thus not even the matrix values will be constant in time (it will be a variable coefficients Poisson Equation).
> Okay, the coefficient variation from the multiphase flow can drastically
> change the characteristics of the solve.
>
>> I need to solve up to the discretization error, so maybe FMG is worth a try.
>> The matrix-free approach is appealing given that the Poisson solver is really mission critical (basically it accounts for most of the simulation time).
>> I will use the level set method and ghost fluid method in order to account for the discontinuities at the interface between phases: the computation of the matrix and rhs
>> values will be influenced by such methods so my only concern is to be sure matrix-free can be used in these circumstances.
> Matrix-free can be used in principle, but those problems can be several
> orders of magnitude more ill-conditioned, so don't invest any more time
> on it right now. Get the discretization set up using assembled
> matrices, then go through the options we've tried to find an efficient
> solver. The best choice will likely depend on the details of the
> formulation, the types of fluids involved, and the geometric
> configuration of the fluids.
>
>> I do not have any prior experience with matrix-free methods so I will have to rely on your assistance for this.
>> Thank you very much.
>>
>> Michele
>>
>>
>>
>>
>> On 08/13/2013 08:36 PM, Jed Brown wrote:
>>> Michele Rosso <mrosso at uci.edu> writes:
>>>
>>>> Hi Jed,
>>>>
>>>> I attached the output for both the runs you suggested. At the beginning
>>>> of each file I included the options I used.
>>>>
>>>> On a side note, I tried to run with a grid of 256^3 (exactly as before)
>>>> but with less levels, i.e. 3 instead of 4 or 5.
>>>> My system stops the run because of an Out Of Memory condition. It is
>>>> really odd since I have not changed anything except
>>>> - pc_mg_levels. I cannot send you any output since there is none. Do
>>>> you have any guess where the problem comes from?
>>> The selected algorithm does a direct solve on the coarse grid. Each
>>> time you reduce the number of levels, the coarse grid size grows by a
>>> factor of 8. Going from 5 to 3 levels is going from a 16^3 coarse grid
>>> to a 64^3 coarse grid. Applying a direct solver to the latter ends up
>>> using a lot of memory. I think this is not worth bothering with and it
>>> might even be (slightly) faster to use 6 levels. That is not where the
>>> time is being spent.
>>>
>>>> -log_summary -ksp_monitor -ksp_view -ksp_converged_reason -pc_type mg
>>>> -pc_mg_galerkin -pc_mg_levels 5 -mg_levels_ksp_type richardson
>>>> -mg_levels_ksp_max_it 1
>>>>
>>>>
>>>>
>>>> 0 KSP Residual norm 3.653965664551e-05
>>>> 1 KSP Residual norm 1.910638846094e-06
>>>> 2 KSP Residual norm 8.690440116045e-08
>>>> 3 KSP Residual norm 3.732213639394e-09
>>>> 4 KSP Residual norm 1.964855338020e-10
>>> This converges well.
>>>
>>>> Max Max/Min Avg Total
>>>> Time (sec): 4.048e+00 1.00012 4.048e+00
>>>> Objects: 2.490e+02 1.00000 2.490e+02
>>>> Flops: 2.663e+08 1.00000 2.663e+08 2.130e+09
>>>> Flops/sec: 6.579e+07 1.00012 6.579e+07 5.263e+08
>>>> MPI Messages: 6.820e+02 1.00000 6.820e+02 5.456e+03
>>>> MPI Message Lengths: 8.245e+06 1.00000 1.209e+04 6.596e+07
>>>> MPI Reductions: 4.580e+02 1.00000
>>>> VecTDot 12 1.0 2.9428e-02 1.2 6.29e+06 1.0 0.0e+00 0.0e+00 1.2e+01 1 2 0 0 3 1 2 0 0 3 1710
>>>> VecNorm 9 1.0 1.0796e-02 1.2 4.72e+06 1.0 0.0e+00 0.0e+00 9.0e+00 0 2 0 0 2 0 2 0 0 2 3497
>>>> VecScale 24 1.0 2.4652e-04 1.1 1.99e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 6442
>>>> VecCopy 3 1.0 5.0740e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> VecSet 116 1.0 1.4349e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> VecAXPY 12 1.0 2.8027e-02 1.0 6.29e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 1796
>>>> VecAYPX 29 1.0 3.0655e-02 1.4 4.16e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 2 0 0 0 1 2 0 0 0 1085
>>>> VecScatterBegin 123 1.0 3.5391e-02 1.1 0.00e+00 0.0 3.5e+03 1.2e+04 0.0e+00 1 0 65 66 0 1 0 65 66 0 0
>>>> VecScatterEnd 123 1.0 2.5395e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatMult 31 1.0 2.3556e-01 1.0 5.62e+07 1.0 1.0e+03 2.3e+04 0.0e+00 6 21 19 36 0 6 21 19 36 0 1908
>>>> MatMultAdd 24 1.0 5.9044e-02 1.0 1.21e+07 1.0 5.8e+02 2.8e+03 0.0e+00 1 5 11 2 0 1 5 11 2 0 1644
>>>> MatMultTranspose 28 1.0 7.4601e-02 1.1 1.42e+07 1.0 6.7e+02 2.8e+03 0.0e+00 2 5 12 3 0 2 5 12 3 0 1518
>>>> MatSolve 6 1.0 3.8311e-03 1.0 1.44e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 3006
>>>> MatSOR 48 1.0 5.8050e-01 1.0 1.01e+08 1.0 8.6e+02 1.5e+04 4.8e+01 14 38 16 19 10 14 38 16 19 11 1390
>>> Most of the solve time is in MatSOR and MatMult. That's expected since
>>> the subdomains are pretty big.
>>>
>>>> MatLUFactorSym 1 1.0 3.0620e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 1 0 0 0 0 1 0
>>>> MatLUFactorNum 1 1.0 2.4665e-02 1.0 1.95e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 7 0 0 0 1 7 0 0 0 6329
>>>> MatAssemblyBegin 20 1.0 2.4351e-02 6.7 0.00e+00 0.0 0.0e+00 0.0e+00 2.2e+01 0 0 0 0 5 0 0 0 0 5 0
>>>> MatAssemblyEnd 20 1.0 1.3176e-01 1.0 0.00e+00 0.0 5.6e+02 2.1e+03 7.2e+01 3 0 10 2 16 3 0 10 2 16 0
>>>> MatGetRowIJ 1 1.0 1.1516e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatGetOrdering 1 1.0 4.1008e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatView 16 1.3 1.0209e-03 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01 0 0 0 0 3 0 0 0 0 3 0
>>>> MatPtAP 4 1.0 6.4001e-01 1.0 4.06e+07 1.0 1.1e+03 1.7e+04 1.0e+02 16 15 21 30 22 16 15 21 30 22 507
>>> MatPtAP dominates the setup time. For profiling, you could register a
>>> stage (PetscLogStageRegister) and time the setup separately from the
>>> solve.
>>>
>>>> MatPtAPSymbolic 4 1.0 3.7003e-01 1.0 0.00e+00 0.0 7.2e+02 2.0e+04 6.0e+01 9 0 13 22 13 9 0 13 22 13 0
>>>> MatPtAPNumeric 4 1.0 2.7004e-01 1.0 4.06e+07 1.0 4.2e+02 1.2e+04 4.0e+01 7 15 8 8 9 7 15 8 8 9 1202
>>>> MatGetRedundant 1 1.0 7.9393e-04 1.0 0.00e+00 0.0 1.7e+02 7.1e+03 4.0e+00 0 0 3 2 1 0 0 3 2 1 0
>>>> MatGetLocalMat 4 1.0 3.9521e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00 1 0 0 0 2 1 0 0 0 2 0
>>>> MatGetBrAoCol 4 1.0 1.7719e-02 1.0 0.00e+00 0.0 4.3e+02 2.7e+04 8.0e+00 0 0 8 18 2 0 0 8 18 2 0
>>>> MatGetSymTrans 8 1.0 1.3007e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> KSPSetUp 7 1.0 1.3097e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01 0 0 0 0 5 0 0 0 0 5 0
>>>> KSPSolve 2 1.0 1.0450e+00 1.0 2.04e+08 1.0 3.4e+03 1.2e+04 7.5e+01 26 77 62 60 16 26 77 62 60 16 1563
>>>> PCSetUp 1 1.0 8.6248e-01 1.0 6.21e+07 1.0 1.9e+03 1.1e+04 3.2e+02 21 23 35 32 69 21 23 35 32 69 576
>>>> PCApply 6 1.0 8.4384e-01 1.0 1.61e+08 1.0 3.2e+03 9.0e+03 4.8e+01 21 60 59 44 10 21 60 59 44 11 1523
>>> Do you know why there are 6 PCApply events? With four iterations of the
>>> Krylov method, there should be only 5 events. Oh, it looks like you do
>>> two solves. Is one of those with a different system?
>>>
>>> Recall that the old KSPSolve time was over 3.35 seconds.
>>>
>>>> -log_summary -ksp_monitor -ksp_view -ksp_converged_reason -pc_type mg
>>>> -pc_mg_galerkin -pc_mg_levels 5 -mg_levels_ksp_type richardson
>>>> -mg_levels_ksp_max_it 1 -pc_mg_type full
>>>>
>>>> 0 KSP Residual norm 3.654533581988e-05
>>>> 1 KSP Residual norm 8.730776244351e-07
>>>> 2 KSP Residual norm 3.474626061661e-08
>>>> 3 KSP Residual norm 1.813665557493e-09
>>> This converges slightly faster, but ends up not paying off.
>>>
>>>> Time (sec): 4.261e+00 1.00012 4.261e+00
>>>> Objects: 2.950e+02 1.00000 2.950e+02
>>>> Flops: 3.322e+08 1.00000 3.322e+08 2.658e+09
>>>> Flops/sec: 7.797e+07 1.00012 7.796e+07 6.237e+08
>>>> MPI Messages: 1.442e+03 1.00000 1.442e+03 1.154e+04
>>>> MPI Message Lengths: 1.018e+07 1.00000 7.057e+03 8.141e+07
>>>> MPI Reductions: 5.460e+02 1.00000
>>> More messages, more work, etc., so not better.
>>>
>>>> KSPSolve 2 1.0 1.2287e+00 1.0 2.70e+08 1.0 9.5e+03 5.8e+03 1.6e+02 29 81 82 68 30 29 81 82 68 30 1758
>>>> PCSetUp 1 1.0 8.6414e-01 1.0 6.21e+07 1.0 1.9e+03 1.1e+04 3.2e+02 20 19 17 26 58 20 19 17 26 58 575
>>>> PCApply 5 1.0 1.0571e+00 1.0 2.33e+08 1.0 9.3e+03 4.9e+03 1.4e+02 24 70 81 56 26 24 70 81 56 26 1764
>>> It's still entirely possible that you can make Full MG beat V-cycles,
>>> especially if you only need to converge up to discretization error. By
>>> my figures, your good solver takes 12 work units to converge well below
>>> discretization error (after Galerkin setup, but maybe you only need to
>>> do that once?). If you only need to equal truncation error, this can be
>>> brought down to about 5 (probably at best a 2x speedup in parallel).
>>> This would involve a high-order (cubic) FMG prolongation.
>>>
>>> Alternatively, you can speed up the implementation (and significantly
>>> reduce memory usage) by creating geometric coarse levels and a
>>> matrix-free implementation of MatSOR and MatMult. (The matrices are
>>> great for experimenting, but if this solver is mission critical and
>>> still a bottleneck, the matrix is an inefficient way to represent the
>>> operator since it has very low arithmetic intensity/requires a lot of
>>> memory bandwidth.) I predict you can probably speed up the solve by
>>> perhaps another factor of 2 with a good matrix-free FMG implementation.
>>> Do you want to go down this path?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130814/23880efc/attachment-0001.html>
More information about the petsc-users
mailing list