[petsc-dev] New implementation of PtAP based on all-at-once algorithm

Mon Apr 15 14:07:49 CDT 2019

On Mon, Apr 15, 2019 at 8:41 AM Mark Adams <mfadams at lbl.gov> wrote:

>
>> I guess you are interested in the performance of the new algorithms on
>>  small problems. I will try to test a petsc example such as
>> mat/examples/tests/ex96.c.
>>
>
> It's not a big deal. And the fact that they are similar on one node tells
> us the kernels are similar.
>
>
>>
>>
>>>
>>> And are you sure the numerics are the same with and without hypre? Hypre
>>> is 15x slower. Any ideas what is going on?
>>>
>>
>> Hypre performs pretty good when the number of processor core is small ( a
>> couple of hundreds).  I guess the issue is related to how they handle the
>> communications.
>>
>>
>>>
>>> It might be interesting to scale this test down to a node to see if this
>>> is from communication.
>>>
>>
> I wonder if the their symbolic setup is getting called every time. You do
> 50 solves it looks like and that should be enough to amortize a one time
> setup cost.
>

Hypre does not have concept called symbolic. They do everything from
scratch, and won't reuse any data.

I have 5 solves only, and each solve is a ten-level method.  The solver
details are shown below (it is not a GAMG). It a customized  and tuned
solver.

SNES Object: 10000 MPI processes
  type: newtonls
  maximum iterations=50, maximum function evaluations=10000
  tolerances: relative=1e-06, absolute=1e-06, solution=1e-50
  total number of linear solver iterations=42
  total number of function evaluations=8
  norm schedule ALWAYS
  Preconditioned is rebuilt every 10 new Jacobians
  SNESLineSearch Object: 10000 MPI processes
    type: basic
    maxstep=1.000000e+08, minlambda=1.000000e-12
    tolerances: relative=1.000000e-08, absolute=1.000000e-15,
lambda=1.000000e-08
    maximum iterations=40
  KSP Object: 10000 MPI processes
    type: fgmres
      restart=200, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
      happy breakdown tolerance 1e-30
    maximum iterations=400, initial guess is zero
    tolerances:  relative=0.5, absolute=1e-50, divergence=1e+100
    right preconditioning
    using UNPRECONDITIONED norm type for convergence test
  PC Object: 10000 MPI processes
    type: mg
     Reuse interpolation 1
      type is MULTIPLICATIVE, levels=11 cycles=v
        Cycles per PCApply=1
        Using Galerkin computed coarse grid matrices for pmat
    Coarse grid solver -- level -------------------------------
      KSP Object: (mg_coarse_) 10000 MPI processes
        type: preonly
        maximum iterations=10000, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
        using NONE norm type for convergence test
      PC Object: (mg_coarse_) 10000 MPI processes
        type: redundant
          First (color=0) of 250 PCs follows
          KSP Object: (mg_coarse_redundant_) 40 MPI processes
            type: preonly
            maximum iterations=10000, initial guess is zero
            tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
            left preconditioning
            using NONE norm type for convergence test
          PC Object: (mg_coarse_redundant_) 40 MPI processes
            type: lu
              out-of-place factorization
              tolerance for zero pivot 2.22045e-14
              using diagonal shift on blocks to prevent zero pivot
[INBLOCKS]
              matrix ordering: natural
              factor fill ratio given 0., needed 0.
                Factored matrix follows:
                  Mat Object: 40 MPI processes
                    type: superlu_dist
                    rows=14976, cols=14976
                    package used to perform factorization: superlu_dist
                    total: nonzeros=0, allocated nonzeros=0
                    total number of mallocs used during MatSetValues calls
=0
                      SuperLU_DIST run parameters:
                        Process grid nprow 5 x npcol 8
                        Equilibrate matrix TRUE
                        Matrix input mode 1
                        Replace tiny pivots FALSE
                        Use iterative refinement FALSE
                        Processors in row 5 col partition 8
                        Row permutation LargeDiag_MC64
                        Column permutation METIS_AT_PLUS_A
                        Parallel symbolic factorization FALSE
                        Repeated factorization SamePattern
            linear system matrix = precond matrix:
            Mat Object: 40 MPI processes
              type: mpiaij
              rows=14976, cols=14976
              total: nonzeros=105984, allocated nonzeros=105984
              total number of mallocs used during MatSetValues calls =0
                not using I-node (on process 0) routines
        linear system matrix = precond matrix:
        Mat Object: 10000 MPI processes
          type: mpiaij
          rows=14976, cols=14976
          total: nonzeros=105984, allocated nonzeros=105984
          total number of mallocs used during MatSetValues calls =0
            not using I-node (on process 0) routines
    Down solver (pre-smoother) on level 1 -------------------------------
      KSP Object: (mg_levels_1_) 10000 MPI processes
        type: chebyshev
          eigenvalue estimates used:  min = 0.16808, max = 1.84888
          eigenvalues estimate via gmres min 0.340446, max 1.6808
          eigenvalues estimated using gmres with translations  [0. 0.1; 0.
1.1]
          KSP Object: (mg_levels_1_esteig_) 10000 MPI processes
            type: gmres
              restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
              happy breakdown tolerance 1e-30
            maximum iterations=10, initial guess is zero
            tolerances:  relative=1e-12, absolute=1e-50, divergence=10000.
            left preconditioning
            using PRECONDITIONED norm type for convergence test
          estimating eigenvalues using noisy right hand side
        maximum iterations=2, nonzero initial guess
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
        using

>
> Does PETSc do any clever scalability tricks? You just pack and send point
> to point messages I would think, but maybe Hypre is doing something bad. I
> have seen Hypre scale out to large machine but on synthetic problems.
>

I honestly do not know. I am not professional at hypre.

>
> So this is a realistic problem. Can you run with -info and grep on GAMG
> and send me the (~20 lines) of output. You will be able to see info about
> each level, like number of equations and average nnz/row.
>

It is not GAMG. It is a hybrid of hypre and petsc. We simply coarsen one
variable (instead of 100 or 1000 variables), and then the interpolation is
expanded to cover all the variables. This way makes the AMGSetup fast, and
also takes advantage of PETSc preconditoners such as bjacobi or ASM.

Fande,

>
>
>>
>> Hypre preforms similarly as petsc on a single compute node.
>>
>>
>> Fande,
>>
>>
>>>
>>> Again, nice work,
>>> Mark
>>>
>>>
>>> On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>>>
>>>> Hi Developers,
>>>>
>>>> I just want to share a good news.  It is known PETSc-ptap-scalable is
>>>> taking too much memory for some applications because it needs to build
>>>> intermediate data structures.  According to Mark's suggestions, I
>>>> implemented the  all-at-once algorithm that does not cache any intermediate
>>>> data.
>>>>
>>>> I did some comparison,  the new implementation is actually scalable in
>>>> terms of the memory usage and the compute time even though it is still
>>>> slower than "ptap-scalable".   There are some memory profiling results (see
>>>> the attachments). The new all-at-once implementation use the similar amount
>>>> of memory as hypre, but it way faster than hypre.
>>>>
>>>> For example, for a problem with 14,893,346,880 unknowns using 10,000
>>>> processor cores,  There are timing results:
>>>>
>>>> Hypre algorithm:
>>>>
>>>> MatPtAP               50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>>>> 3.3e+04 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
>>>> MatPtAPSymbolic       50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatPtAPNumeric        50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>>>> 3.3e+04 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
>>>>
>>>> PETSc scalable PtAP:
>>>>
>>>> MatPtAP               50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07
>>>> 2.0e+05 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
>>>> MatPtAPSymbolic       50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07
>>>> 1.4e+05 3.5e+02  1  0  3  3  9   1  0  3  3  9     0
>>>> MatPtAPNumeric        50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07
>>>> 3.1e+05 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>>>>
>>>> New implementation of the all-at-once algorithm:
>>>>
>>>> MatPtAP               50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08
>>>> 1.4e+05 6.0e+02  4  0  7  7 17   4  0  7  7 17     0
>>>> MatPtAPSymbolic       50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07
>>>> 1.2e+05 2.0e+02  2  0  5  4  6   2  0  5  4  6     0
>>>> MatPtAPNumeric        50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07
>>>> 2.0e+05 4.0e+02  2  0  2  3 11   2  0  2  3 11     0
>>>>
>>>>
>>>> You can see here the all-at-once is a bit slower than ptap-scalable,
>>>> but it uses only much less memory.
>>>>
>>>>
>>>> Fande
>>>>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190415/95dc8dd2/attachment.html>