[petsc-dev] New implementation of PtAP based on all-at-once algorithm
Fande Kong
fdkong.jd at gmail.com
Mon Apr 15 14:07:49 CDT 2019
On Mon, Apr 15, 2019 at 8:41 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>> I guess you are interested in the performance of the new algorithms on
>> small problems. I will try to test a petsc example such as
>> mat/examples/tests/ex96.c.
>>
>
> It's not a big deal. And the fact that they are similar on one node tells
> us the kernels are similar.
>
>
>>
>>
>>>
>>> And are you sure the numerics are the same with and without hypre? Hypre
>>> is 15x slower. Any ideas what is going on?
>>>
>>
>> Hypre performs pretty good when the number of processor core is small ( a
>> couple of hundreds). I guess the issue is related to how they handle the
>> communications.
>>
>>
>>>
>>> It might be interesting to scale this test down to a node to see if this
>>> is from communication.
>>>
>>
> I wonder if the their symbolic setup is getting called every time. You do
> 50 solves it looks like and that should be enough to amortize a one time
> setup cost.
>
Hypre does not have concept called symbolic. They do everything from
scratch, and won't reuse any data.
I have 5 solves only, and each solve is a ten-level method. The solver
details are shown below (it is not a GAMG). It a customized and tuned
solver.
SNES Object: 10000 MPI processes
type: newtonls
maximum iterations=50, maximum function evaluations=10000
tolerances: relative=1e-06, absolute=1e-06, solution=1e-50
total number of linear solver iterations=42
total number of function evaluations=8
norm schedule ALWAYS
Preconditioned is rebuilt every 10 new Jacobians
SNESLineSearch Object: 10000 MPI processes
type: basic
maxstep=1.000000e+08, minlambda=1.000000e-12
tolerances: relative=1.000000e-08, absolute=1.000000e-15,
lambda=1.000000e-08
maximum iterations=40
KSP Object: 10000 MPI processes
type: fgmres
restart=200, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
happy breakdown tolerance 1e-30
maximum iterations=400, initial guess is zero
tolerances: relative=0.5, absolute=1e-50, divergence=1e+100
right preconditioning
using UNPRECONDITIONED norm type for convergence test
PC Object: 10000 MPI processes
type: mg
Reuse interpolation 1
type is MULTIPLICATIVE, levels=11 cycles=v
Cycles per PCApply=1
Using Galerkin computed coarse grid matrices for pmat
Coarse grid solver -- level -------------------------------
KSP Object: (mg_coarse_) 10000 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (mg_coarse_) 10000 MPI processes
type: redundant
First (color=0) of 250 PCs follows
KSP Object: (mg_coarse_redundant_) 40 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (mg_coarse_redundant_) 40 MPI processes
type: lu
out-of-place factorization
tolerance for zero pivot 2.22045e-14
using diagonal shift on blocks to prevent zero pivot
[INBLOCKS]
matrix ordering: natural
factor fill ratio given 0., needed 0.
Factored matrix follows:
Mat Object: 40 MPI processes
type: superlu_dist
rows=14976, cols=14976
package used to perform factorization: superlu_dist
total: nonzeros=0, allocated nonzeros=0
total number of mallocs used during MatSetValues calls
=0
SuperLU_DIST run parameters:
Process grid nprow 5 x npcol 8
Equilibrate matrix TRUE
Matrix input mode 1
Replace tiny pivots FALSE
Use iterative refinement FALSE
Processors in row 5 col partition 8
Row permutation LargeDiag_MC64
Column permutation METIS_AT_PLUS_A
Parallel symbolic factorization FALSE
Repeated factorization SamePattern
linear system matrix = precond matrix:
Mat Object: 40 MPI processes
type: mpiaij
rows=14976, cols=14976
total: nonzeros=105984, allocated nonzeros=105984
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
linear system matrix = precond matrix:
Mat Object: 10000 MPI processes
type: mpiaij
rows=14976, cols=14976
total: nonzeros=105984, allocated nonzeros=105984
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Down solver (pre-smoother) on level 1 -------------------------------
KSP Object: (mg_levels_1_) 10000 MPI processes
type: chebyshev
eigenvalue estimates used: min = 0.16808, max = 1.84888
eigenvalues estimate via gmres min 0.340446, max 1.6808
eigenvalues estimated using gmres with translations [0. 0.1; 0.
1.1]
KSP Object: (mg_levels_1_esteig_) 10000 MPI processes
type: gmres
restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
happy breakdown tolerance 1e-30
maximum iterations=10, initial guess is zero
tolerances: relative=1e-12, absolute=1e-50, divergence=10000.
left preconditioning
using PRECONDITIONED norm type for convergence test
estimating eigenvalues using noisy right hand side
maximum iterations=2, nonzero initial guess
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using
>
> Does PETSc do any clever scalability tricks? You just pack and send point
> to point messages I would think, but maybe Hypre is doing something bad. I
> have seen Hypre scale out to large machine but on synthetic problems.
>
I honestly do not know. I am not professional at hypre.
>
> So this is a realistic problem. Can you run with -info and grep on GAMG
> and send me the (~20 lines) of output. You will be able to see info about
> each level, like number of equations and average nnz/row.
>
It is not GAMG. It is a hybrid of hypre and petsc. We simply coarsen one
variable (instead of 100 or 1000 variables), and then the interpolation is
expanded to cover all the variables. This way makes the AMGSetup fast, and
also takes advantage of PETSc preconditoners such as bjacobi or ASM.
Fande,
>
>
>>
>> Hypre preforms similarly as petsc on a single compute node.
>>
>>
>> Fande,
>>
>>
>>>
>>> Again, nice work,
>>> Mark
>>>
>>>
>>> On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>>>
>>>> Hi Developers,
>>>>
>>>> I just want to share a good news. It is known PETSc-ptap-scalable is
>>>> taking too much memory for some applications because it needs to build
>>>> intermediate data structures. According to Mark's suggestions, I
>>>> implemented the all-at-once algorithm that does not cache any intermediate
>>>> data.
>>>>
>>>> I did some comparison, the new implementation is actually scalable in
>>>> terms of the memory usage and the compute time even though it is still
>>>> slower than "ptap-scalable". There are some memory profiling results (see
>>>> the attachments). The new all-at-once implementation use the similar amount
>>>> of memory as hypre, but it way faster than hypre.
>>>>
>>>> For example, for a problem with 14,893,346,880 unknowns using 10,000
>>>> processor cores, There are timing results:
>>>>
>>>> Hypre algorithm:
>>>>
>>>> MatPtAP 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>>>> 3.3e+04 6.0e+02 33 0 1 0 17 33 0 1 0 17 0
>>>> MatPtAPSymbolic 50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatPtAPNumeric 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>>>> 3.3e+04 6.0e+02 33 0 1 0 17 33 0 1 0 17 0
>>>>
>>>> PETSc scalable PtAP:
>>>>
>>>> MatPtAP 50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07
>>>> 2.0e+05 7.5e+02 2 1 4 6 20 2 1 4 6 20 129418
>>>> MatPtAPSymbolic 50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07
>>>> 1.4e+05 3.5e+02 1 0 3 3 9 1 0 3 3 9 0
>>>> MatPtAPNumeric 50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07
>>>> 3.1e+05 4.0e+02 1 1 2 4 11 1 1 2 4 11 235011
>>>>
>>>> New implementation of the all-at-once algorithm:
>>>>
>>>> MatPtAP 50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08
>>>> 1.4e+05 6.0e+02 4 0 7 7 17 4 0 7 7 17 0
>>>> MatPtAPSymbolic 50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07
>>>> 1.2e+05 2.0e+02 2 0 5 4 6 2 0 5 4 6 0
>>>> MatPtAPNumeric 50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07
>>>> 2.0e+05 4.0e+02 2 0 2 3 11 2 0 2 3 11 0
>>>>
>>>>
>>>> You can see here the all-at-once is a bit slower than ptap-scalable,
>>>> but it uses only much less memory.
>>>>
>>>>
>>>> Fande
>>>>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190415/95dc8dd2/attachment.html>
More information about the petsc-dev
mailing list