<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <font face="Ubuntu">Jed,<br>

      <br>

      thank you. I will proceed with the multiphase test case and  I

      will let you know how it goes and eventually, if possible, try the

      matrix-free approach.<br>

      Nevertheless I would like to give a try  FMG with high-order

      prolongation: could you explain me how I can do that?<br>

      Thank you.<br>

      <br>

      Michele<br>

      <br>

    </font>

    <div class="moz-cite-prefix">On 08/14/2013 04:35 AM, Jed Brown

      wrote:<br>

    </div>

    <blockquote cite="mid:874nasld2x.fsf@mcs.anl.gov" type="cite">

      <pre wrap="">Please always use "reply-all" so that your messages go to the list.

This is standard mailing list etiquette.  It is important to preserve

threading for people who find this discussion later and so that we do

not waste our time re-answering the same questions that have already

been answered in private side-conversations.  You'll likely get an

answer faster that way too.

Michele Rosso <a class="moz-txt-link-rfc2396E" href="mailto:mrosso@uci.edu"><mrosso@uci.edu></a> writes:

</pre>

      <blockquote type="cite">

        <pre wrap="">Jed,

thank you very much for the detailed analysis.

I confirm that

  -log_summary -ksp_monitor -ksp_view -ksp_converged_reason -pc_type mg

   -pc_mg_galerkin -pc_mg_levels 5 -mg_levels_ksp_type richardson

   -mg_levels_ksp_max_it 1

results in a faster solve with the 256^3 grid (it finally beats CG + ICC).

Yes, I perform different solves: I setup matrix and KSP only once at the beginning of the run and then I re-use them

at each time step. The rhs term changes during the simulation though. For now (single phase flow) the matrix does not change,

but I will be dealing soon with a multiphase flow and thus not even the matrix values will be constant in time (it will be a variable coefficients Poisson Equation).

</pre>

      </blockquote>

      <pre wrap="">

Okay, the coefficient variation from the multiphase flow can drastically

change the characteristics of the solve.

</pre>

      <blockquote type="cite">

        <pre wrap="">I need to solve up to the discretization error, so maybe FMG is worth a try.

The matrix-free approach is appealing given that the Poisson solver is really mission critical (basically it accounts for most of the simulation time).

I will use the level set method and ghost fluid method in order to account for the discontinuities at the interface between phases: the computation of the matrix and rhs

values will be influenced by such methods so my only concern is to be sure matrix-free can be used in these circumstances.

</pre>

      </blockquote>

      <pre wrap="">

Matrix-free can be used in principle, but those problems can be several

orders of magnitude more ill-conditioned, so don't invest any more time

on it right now.  Get the discretization set up using assembled

matrices, then go through the options we've tried to find an efficient

solver.  The best choice will likely depend on the details of the

formulation, the types of fluids involved, and the geometric

configuration of the fluids.

</pre>

      <blockquote type="cite">

        <pre wrap="">I do not have any prior experience with matrix-free methods so I will have to rely on your assistance for this.

Thank you very much.

Michele

On 08/13/2013 08:36 PM, Jed Brown wrote:

</pre>

        <blockquote type="cite">

          <pre wrap="">Michele Rosso <a class="moz-txt-link-rfc2396E" href="mailto:mrosso@uci.edu"><mrosso@uci.edu></a> writes:

</pre>

          <blockquote type="cite">

            <pre wrap="">Hi Jed,

I attached the output for both the runs you suggested. At the beginning

of each file I included the options I used.

On a side note, I tried to run with a grid of 256^3 (exactly as before)

but with less levels, i.e. 3 instead of 4 or 5.

My system stops the run because of an Out Of Memory condition. It is

really odd since I have not changed anything except

- pc_mg_levels.  I cannot send you any output since there is none. Do

you have any guess where the problem comes from?

</pre>

          </blockquote>

          <pre wrap="">The selected algorithm does a direct solve on the coarse grid.  Each

time you reduce the number of levels, the coarse grid size grows by a

factor of 8.  Going from 5 to 3 levels is going from a 16^3 coarse grid

to a 64^3 coarse grid.  Applying a direct solver to the latter ends up

using a lot of memory.  I think this is not worth bothering with and it

might even be (slightly) faster to use 6 levels.  That is not where the

time is being spent.

</pre>

          <blockquote type="cite">

            <pre wrap="">   -log_summary -ksp_monitor -ksp_view -ksp_converged_reason -pc_type mg

   -pc_mg_galerkin -pc_mg_levels 5 -mg_levels_ksp_type richardson

   -mg_levels_ksp_max_it 1

   0 KSP Residual norm 3.653965664551e-05

   1 KSP Residual norm 1.910638846094e-06

   2 KSP Residual norm 8.690440116045e-08

   3 KSP Residual norm 3.732213639394e-09

   4 KSP Residual norm 1.964855338020e-10

</pre>

          </blockquote>

          <pre wrap="">This converges well.

</pre>

          <blockquote type="cite">

            <pre wrap="">                          Max       Max/Min        Avg      Total

Time (sec):           4.048e+00      1.00012   4.048e+00

Objects:              2.490e+02      1.00000   2.490e+02

Flops:                2.663e+08      1.00000   2.663e+08  2.130e+09

Flops/sec:            6.579e+07      1.00012   6.579e+07  5.263e+08

MPI Messages:         6.820e+02      1.00000   6.820e+02  5.456e+03

MPI Message Lengths:  8.245e+06      1.00000   1.209e+04  6.596e+07

MPI Reductions:       4.580e+02      1.00000

VecTDot               12 1.0 2.9428e-02 1.2 6.29e+06 1.0 0.0e+00 0.0e+00 1.2e+01  1  2  0  0  3   1  2  0  0  3  1710

VecNorm                9 1.0 1.0796e-02 1.2 4.72e+06 1.0 0.0e+00 0.0e+00 9.0e+00  0  2  0  0  2   0  2  0  0  2  3497

VecScale              24 1.0 2.4652e-04 1.1 1.99e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  6442

VecCopy                3 1.0 5.0740e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

VecSet               116 1.0 1.4349e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

VecAXPY               12 1.0 2.8027e-02 1.0 6.29e+06 1.0 0.0e+00 0.0e+00 0.0e+00  1  2  0  0  0   1  2  0  0  0  1796

VecAYPX               29 1.0 3.0655e-02 1.4 4.16e+06 1.0 0.0e+00 0.0e+00 0.0e+00  1  2  0  0  0   1  2  0  0  0  1085

VecScatterBegin      123 1.0 3.5391e-02 1.1 0.00e+00 0.0 3.5e+03 1.2e+04 0.0e+00  1  0 65 66  0   1  0 65 66  0     0

VecScatterEnd        123 1.0 2.5395e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

MatMult               31 1.0 2.3556e-01 1.0 5.62e+07 1.0 1.0e+03 2.3e+04 0.0e+00  6 21 19 36  0   6 21 19 36  0  1908

MatMultAdd            24 1.0 5.9044e-02 1.0 1.21e+07 1.0 5.8e+02 2.8e+03 0.0e+00  1  5 11  2  0   1  5 11  2  0  1644

MatMultTranspose      28 1.0 7.4601e-02 1.1 1.42e+07 1.0 6.7e+02 2.8e+03 0.0e+00  2  5 12  3  0   2  5 12  3  0  1518

MatSolve               6 1.0 3.8311e-03 1.0 1.44e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  3006

MatSOR                48 1.0 5.8050e-01 1.0 1.01e+08 1.0 8.6e+02 1.5e+04 4.8e+01 14 38 16 19 10  14 38 16 19 11  1390

</pre>

          </blockquote>

          <pre wrap="">Most of the solve time is in MatSOR and MatMult.  That's expected since

the subdomains are pretty big.

</pre>

          <blockquote type="cite">

            <pre wrap="">MatLUFactorSym         1 1.0 3.0620e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  1   0  0  0  0  1     0

MatLUFactorNum         1 1.0 2.4665e-02 1.0 1.95e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  7  0  0  0   1  7  0  0  0  6329

MatAssemblyBegin      20 1.0 2.4351e-02 6.7 0.00e+00 0.0 0.0e+00 0.0e+00 2.2e+01  0  0  0  0  5   0  0  0  0  5     0

MatAssemblyEnd        20 1.0 1.3176e-01 1.0 0.00e+00 0.0 5.6e+02 2.1e+03 7.2e+01  3  0 10  2 16   3  0 10  2 16     0

MatGetRowIJ            1 1.0 1.1516e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

MatGetOrdering         1 1.0 4.1008e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0

MatView               16 1.3 1.0209e-03 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01  0  0  0  0  3   0  0  0  0  3     0

MatPtAP                4 1.0 6.4001e-01 1.0 4.06e+07 1.0 1.1e+03 1.7e+04 1.0e+02 16 15 21 30 22  16 15 21 30 22   507

</pre>

          </blockquote>

          <pre wrap="">MatPtAP dominates the setup time.  For profiling, you could register a

stage (PetscLogStageRegister) and time the setup separately from the

solve.

</pre>

          <blockquote type="cite">

            <pre wrap="">MatPtAPSymbolic        4 1.0 3.7003e-01 1.0 0.00e+00 0.0 7.2e+02 2.0e+04 6.0e+01  9  0 13 22 13   9  0 13 22 13     0

MatPtAPNumeric         4 1.0 2.7004e-01 1.0 4.06e+07 1.0 4.2e+02 1.2e+04 4.0e+01  7 15  8  8  9   7 15  8  8  9  1202

MatGetRedundant        1 1.0 7.9393e-04 1.0 0.00e+00 0.0 1.7e+02 7.1e+03 4.0e+00  0  0  3  2  1   0  0  3  2  1     0

MatGetLocalMat         4 1.0 3.9521e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  1  0  0  0  2   1  0  0  0  2     0

MatGetBrAoCol          4 1.0 1.7719e-02 1.0 0.00e+00 0.0 4.3e+02 2.7e+04 8.0e+00  0  0  8 18  2   0  0  8 18  2     0

MatGetSymTrans         8 1.0 1.3007e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

KSPSetUp               7 1.0 1.3097e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01  0  0  0  0  5   0  0  0  0  5     0

KSPSolve               2 1.0 1.0450e+00 1.0 2.04e+08 1.0 3.4e+03 1.2e+04 7.5e+01 26 77 62 60 16  26 77 62 60 16  1563

PCSetUp                1 1.0 8.6248e-01 1.0 6.21e+07 1.0 1.9e+03 1.1e+04 3.2e+02 21 23 35 32 69  21 23 35 32 69   576

PCApply                6 1.0 8.4384e-01 1.0 1.61e+08 1.0 3.2e+03 9.0e+03 4.8e+01 21 60 59 44 10  21 60 59 44 11  1523

</pre>

          </blockquote>

          <pre wrap="">Do you know why there are 6 PCApply events?  With four iterations of the

Krylov method, there should be only 5 events.  Oh, it looks like you do

two solves.  Is one of those with a different system?

Recall that the old KSPSolve time was over 3.35 seconds.

</pre>

          <blockquote type="cite">

            <pre wrap="">-log_summary -ksp_monitor -ksp_view -ksp_converged_reason -pc_type mg

-pc_mg_galerkin -pc_mg_levels 5 -mg_levels_ksp_type richardson

-mg_levels_ksp_max_it 1 -pc_mg_type full

   0 KSP Residual norm 3.654533581988e-05

   1 KSP Residual norm 8.730776244351e-07

   2 KSP Residual norm 3.474626061661e-08

   3 KSP Residual norm 1.813665557493e-09

</pre>

          </blockquote>

          <pre wrap="">This converges slightly faster, but ends up not paying off.

</pre>

          <blockquote type="cite">

            <pre wrap="">Time (sec):           4.261e+00      1.00012   4.261e+00

Objects:              2.950e+02      1.00000   2.950e+02

Flops:                3.322e+08      1.00000   3.322e+08  2.658e+09

Flops/sec:            7.797e+07      1.00012   7.796e+07  6.237e+08

MPI Messages:         1.442e+03      1.00000   1.442e+03  1.154e+04

MPI Message Lengths:  1.018e+07      1.00000   7.057e+03  8.141e+07

MPI Reductions:       5.460e+02      1.00000

</pre>

          </blockquote>

          <pre wrap="">More messages, more work, etc., so not better.

</pre>

          <blockquote type="cite">

            <pre wrap="">KSPSolve               2 1.0 1.2287e+00 1.0 2.70e+08 1.0 9.5e+03 5.8e+03 1.6e+02 29 81 82 68 30  29 81 82 68 30  1758

PCSetUp                1 1.0 8.6414e-01 1.0 6.21e+07 1.0 1.9e+03 1.1e+04 3.2e+02 20 19 17 26 58  20 19 17 26 58   575

PCApply                5 1.0 1.0571e+00 1.0 2.33e+08 1.0 9.3e+03 4.9e+03 1.4e+02 24 70 81 56 26  24 70 81 56 26  1764

</pre>

          </blockquote>

          <pre wrap="">It's still entirely possible that you can make Full MG beat V-cycles,

especially if you only need to converge up to discretization error.  By

my figures, your good solver takes 12 work units to converge well below

discretization error (after Galerkin setup, but maybe you only need to

do that once?).  If you only need to equal truncation error, this can be

brought down to about 5 (probably at best a 2x speedup in parallel).

This would involve a high-order (cubic) FMG prolongation.

Alternatively, you can speed up the implementation (and significantly

reduce memory usage) by creating geometric coarse levels and a

matrix-free implementation of MatSOR and MatMult.  (The matrices are

great for experimenting, but if this solver is mission critical and

still a bottleneck, the matrix is an inefficient way to represent the

operator since it has very low arithmetic intensity/requires a lot of

memory bandwidth.)  I predict you can probably speed up the solve by

perhaps another factor of 2 with a good matrix-free FMG implementation.

Do you want to go down this path?

</pre>

        </blockquote>

      </blockquote>

    </blockquote>

    <br>

  </body>

</html>