[petsc-users] GAMG speed

Wed Aug 14 14:34:27 CDT 2013

Jed,

thank you. I will proceed with the multiphase test case and  I will let 
you know how it goes and eventually, if possible, try the matrix-free 
approach.
Nevertheless I would like to give a try  FMG with high-order 
prolongation: could you explain me how I can do that?
Thank you.

Michele

On 08/14/2013 04:35 AM, Jed Brown wrote:
> Please always use "reply-all" so that your messages go to the list.
> This is standard mailing list etiquette.  It is important to preserve
> threading for people who find this discussion later and so that we do
> not waste our time re-answering the same questions that have already
> been answered in private side-conversations.  You'll likely get an
> answer faster that way too.
>
> Michele Rosso <mrosso at uci.edu> writes:
>
>> Jed,
>>
>> thank you very much for the detailed analysis.
>> I confirm that
>>
>>    -log_summary -ksp_monitor -ksp_view -ksp_converged_reason -pc_type mg
>>     -pc_mg_galerkin -pc_mg_levels 5 -mg_levels_ksp_type richardson
>>     -mg_levels_ksp_max_it 1
>>
>> results in a faster solve with the 256^3 grid (it finally beats CG + ICC).
>>
>> Yes, I perform different solves: I setup matrix and KSP only once at the beginning of the run and then I re-use them
>> at each time step. The rhs term changes during the simulation though. For now (single phase flow) the matrix does not change,
>> but I will be dealing soon with a multiphase flow and thus not even the matrix values will be constant in time (it will be a variable coefficients Poisson Equation).
> Okay, the coefficient variation from the multiphase flow can drastically
> change the characteristics of the solve.
>
>> I need to solve up to the discretization error, so maybe FMG is worth a try.
>> The matrix-free approach is appealing given that the Poisson solver is really mission critical (basically it accounts for most of the simulation time).
>> I will use the level set method and ghost fluid method in order to account for the discontinuities at the interface between phases: the computation of the matrix and rhs
>> values will be influenced by such methods so my only concern is to be sure matrix-free can be used in these circumstances.
> Matrix-free can be used in principle, but those problems can be several
> orders of magnitude more ill-conditioned, so don't invest any more time
> on it right now.  Get the discretization set up using assembled
> matrices, then go through the options we've tried to find an efficient
> solver.  The best choice will likely depend on the details of the
> formulation, the types of fluids involved, and the geometric
> configuration of the fluids.
>
>> I do not have any prior experience with matrix-free methods so I will have to rely on your assistance for this.
>> Thank you very much.
>>
>> Michele
>>
>>
>>
>>
>> On 08/13/2013 08:36 PM, Jed Brown wrote:
>>> Michele Rosso <mrosso at uci.edu> writes:
>>>
>>>> Hi Jed,
>>>>
>>>> I attached the output for both the runs you suggested. At the beginning
>>>> of each file I included the options I used.
>>>>
>>>> On a side note, I tried to run with a grid of 256^3 (exactly as before)
>>>> but with less levels, i.e. 3 instead of 4 or 5.
>>>> My system stops the run because of an Out Of Memory condition. It is
>>>> really odd since I have not changed anything except
>>>> - pc_mg_levels.  I cannot send you any output since there is none. Do
>>>> you have any guess where the problem comes from?
>>> The selected algorithm does a direct solve on the coarse grid.  Each
>>> time you reduce the number of levels, the coarse grid size grows by a
>>> factor of 8.  Going from 5 to 3 levels is going from a 16^3 coarse grid
>>> to a 64^3 coarse grid.  Applying a direct solver to the latter ends up
>>> using a lot of memory.  I think this is not worth bothering with and it
>>> might even be (slightly) faster to use 6 levels.  That is not where the
>>> time is being spent.
>>>
>>>>     -log_summary -ksp_monitor -ksp_view -ksp_converged_reason -pc_type mg
>>>>     -pc_mg_galerkin -pc_mg_levels 5 -mg_levels_ksp_type richardson
>>>>     -mg_levels_ksp_max_it 1
>>>>
>>>>
>>>>
>>>>     0 KSP Residual norm 3.653965664551e-05
>>>>     1 KSP Residual norm 1.910638846094e-06
>>>>     2 KSP Residual norm 8.690440116045e-08
>>>>     3 KSP Residual norm 3.732213639394e-09
>>>>     4 KSP Residual norm 1.964855338020e-10
>>> This converges well.
>>>
>>>>                            Max       Max/Min        Avg      Total
>>>> Time (sec):           4.048e+00      1.00012   4.048e+00
>>>> Objects:              2.490e+02      1.00000   2.490e+02
>>>> Flops:                2.663e+08      1.00000   2.663e+08  2.130e+09
>>>> Flops/sec:            6.579e+07      1.00012   6.579e+07  5.263e+08
>>>> MPI Messages:         6.820e+02      1.00000   6.820e+02  5.456e+03
>>>> MPI Message Lengths:  8.245e+06      1.00000   1.209e+04  6.596e+07
>>>> MPI Reductions:       4.580e+02      1.00000
>>>> VecTDot               12 1.0 2.9428e-02 1.2 6.29e+06 1.0 0.0e+00 0.0e+00 1.2e+01  1  2  0  0  3   1  2  0  0  3  1710
>>>> VecNorm                9 1.0 1.0796e-02 1.2 4.72e+06 1.0 0.0e+00 0.0e+00 9.0e+00  0  2  0  0  2   0  2  0  0  2  3497
>>>> VecScale              24 1.0 2.4652e-04 1.1 1.99e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  6442
>>>> VecCopy                3 1.0 5.0740e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecSet               116 1.0 1.4349e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecAXPY               12 1.0 2.8027e-02 1.0 6.29e+06 1.0 0.0e+00 0.0e+00 0.0e+00  1  2  0  0  0   1  2  0  0  0  1796
>>>> VecAYPX               29 1.0 3.0655e-02 1.4 4.16e+06 1.0 0.0e+00 0.0e+00 0.0e+00  1  2  0  0  0   1  2  0  0  0  1085
>>>> VecScatterBegin      123 1.0 3.5391e-02 1.1 0.00e+00 0.0 3.5e+03 1.2e+04 0.0e+00  1  0 65 66  0   1  0 65 66  0     0
>>>> VecScatterEnd        123 1.0 2.5395e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatMult               31 1.0 2.3556e-01 1.0 5.62e+07 1.0 1.0e+03 2.3e+04 0.0e+00  6 21 19 36  0   6 21 19 36  0  1908
>>>> MatMultAdd            24 1.0 5.9044e-02 1.0 1.21e+07 1.0 5.8e+02 2.8e+03 0.0e+00  1  5 11  2  0   1  5 11  2  0  1644
>>>> MatMultTranspose      28 1.0 7.4601e-02 1.1 1.42e+07 1.0 6.7e+02 2.8e+03 0.0e+00  2  5 12  3  0   2  5 12  3  0  1518
>>>> MatSolve               6 1.0 3.8311e-03 1.0 1.44e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  3006
>>>> MatSOR                48 1.0 5.8050e-01 1.0 1.01e+08 1.0 8.6e+02 1.5e+04 4.8e+01 14 38 16 19 10  14 38 16 19 11  1390
>>> Most of the solve time is in MatSOR and MatMult.  That's expected since
>>> the subdomains are pretty big.
>>>
>>>> MatLUFactorSym         1 1.0 3.0620e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  1   0  0  0  0  1     0
>>>> MatLUFactorNum         1 1.0 2.4665e-02 1.0 1.95e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  7  0  0  0   1  7  0  0  0  6329
>>>> MatAssemblyBegin      20 1.0 2.4351e-02 6.7 0.00e+00 0.0 0.0e+00 0.0e+00 2.2e+01  0  0  0  0  5   0  0  0  0  5     0
>>>> MatAssemblyEnd        20 1.0 1.3176e-01 1.0 0.00e+00 0.0 5.6e+02 2.1e+03 7.2e+01  3  0 10  2 16   3  0 10  2 16     0
>>>> MatGetRowIJ            1 1.0 1.1516e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatGetOrdering         1 1.0 4.1008e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatView               16 1.3 1.0209e-03 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01  0  0  0  0  3   0  0  0  0  3     0
>>>> MatPtAP                4 1.0 6.4001e-01 1.0 4.06e+07 1.0 1.1e+03 1.7e+04 1.0e+02 16 15 21 30 22  16 15 21 30 22   507
>>> MatPtAP dominates the setup time.  For profiling, you could register a
>>> stage (PetscLogStageRegister) and time the setup separately from the
>>> solve.
>>>
>>>> MatPtAPSymbolic        4 1.0 3.7003e-01 1.0 0.00e+00 0.0 7.2e+02 2.0e+04 6.0e+01  9  0 13 22 13   9  0 13 22 13     0
>>>> MatPtAPNumeric         4 1.0 2.7004e-01 1.0 4.06e+07 1.0 4.2e+02 1.2e+04 4.0e+01  7 15  8  8  9   7 15  8  8  9  1202
>>>> MatGetRedundant        1 1.0 7.9393e-04 1.0 0.00e+00 0.0 1.7e+02 7.1e+03 4.0e+00  0  0  3  2  1   0  0  3  2  1     0
>>>> MatGetLocalMat         4 1.0 3.9521e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  1  0  0  0  2   1  0  0  0  2     0
>>>> MatGetBrAoCol          4 1.0 1.7719e-02 1.0 0.00e+00 0.0 4.3e+02 2.7e+04 8.0e+00  0  0  8 18  2   0  0  8 18  2     0
>>>> MatGetSymTrans         8 1.0 1.3007e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPSetUp               7 1.0 1.3097e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01  0  0  0  0  5   0  0  0  0  5     0
>>>> KSPSolve               2 1.0 1.0450e+00 1.0 2.04e+08 1.0 3.4e+03 1.2e+04 7.5e+01 26 77 62 60 16  26 77 62 60 16  1563
>>>> PCSetUp                1 1.0 8.6248e-01 1.0 6.21e+07 1.0 1.9e+03 1.1e+04 3.2e+02 21 23 35 32 69  21 23 35 32 69   576
>>>> PCApply                6 1.0 8.4384e-01 1.0 1.61e+08 1.0 3.2e+03 9.0e+03 4.8e+01 21 60 59 44 10  21 60 59 44 11  1523
>>> Do you know why there are 6 PCApply events?  With four iterations of the
>>> Krylov method, there should be only 5 events.  Oh, it looks like you do
>>> two solves.  Is one of those with a different system?
>>>
>>> Recall that the old KSPSolve time was over 3.35 seconds.
>>>
>>>> -log_summary -ksp_monitor -ksp_view -ksp_converged_reason -pc_type mg
>>>> -pc_mg_galerkin -pc_mg_levels 5 -mg_levels_ksp_type richardson
>>>> -mg_levels_ksp_max_it 1 -pc_mg_type full
>>>>
>>>>     0 KSP Residual norm 3.654533581988e-05
>>>>     1 KSP Residual norm 8.730776244351e-07
>>>>     2 KSP Residual norm 3.474626061661e-08
>>>>     3 KSP Residual norm 1.813665557493e-09
>>> This converges slightly faster, but ends up not paying off.
>>>
>>>> Time (sec):           4.261e+00      1.00012   4.261e+00
>>>> Objects:              2.950e+02      1.00000   2.950e+02
>>>> Flops:                3.322e+08      1.00000   3.322e+08  2.658e+09
>>>> Flops/sec:            7.797e+07      1.00012   7.796e+07  6.237e+08
>>>> MPI Messages:         1.442e+03      1.00000   1.442e+03  1.154e+04
>>>> MPI Message Lengths:  1.018e+07      1.00000   7.057e+03  8.141e+07
>>>> MPI Reductions:       5.460e+02      1.00000
>>> More messages, more work, etc., so not better.
>>>
>>>> KSPSolve               2 1.0 1.2287e+00 1.0 2.70e+08 1.0 9.5e+03 5.8e+03 1.6e+02 29 81 82 68 30  29 81 82 68 30  1758
>>>> PCSetUp                1 1.0 8.6414e-01 1.0 6.21e+07 1.0 1.9e+03 1.1e+04 3.2e+02 20 19 17 26 58  20 19 17 26 58   575
>>>> PCApply                5 1.0 1.0571e+00 1.0 2.33e+08 1.0 9.3e+03 4.9e+03 1.4e+02 24 70 81 56 26  24 70 81 56 26  1764
>>> It's still entirely possible that you can make Full MG beat V-cycles,
>>> especially if you only need to converge up to discretization error.  By
>>> my figures, your good solver takes 12 work units to converge well below
>>> discretization error (after Galerkin setup, but maybe you only need to
>>> do that once?).  If you only need to equal truncation error, this can be
>>> brought down to about 5 (probably at best a 2x speedup in parallel).
>>> This would involve a high-order (cubic) FMG prolongation.
>>>
>>> Alternatively, you can speed up the implementation (and significantly
>>> reduce memory usage) by creating geometric coarse levels and a
>>> matrix-free implementation of MatSOR and MatMult.  (The matrices are
>>> great for experimenting, but if this solver is mission critical and
>>> still a bottleneck, the matrix is an inefficient way to represent the
>>> operator since it has very low arithmetic intensity/requires a lot of
>>> memory bandwidth.)  I predict you can probably speed up the solve by
>>> perhaps another factor of 2 with a good matrix-free FMG implementation.
>>> Do you want to go down this path?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130814/23880efc/attachment-0001.html>