[petsc-dev] EXTERNAL: Re: Do you have any suggestion in order to reduce my memory footprint?

Wed Oct 22 18:32:45 CDT 2014

> On Oct 22, 2014, at 2:58 PM, Alletto, John M <john.m.alletto at lmco.com> wrote:
> 
> Hi Barry,
> 
> When I get my code back up and running and execute the -ksp_view how will I recognize it?

   You’ll know it when you see it.
> 
> I am new to PETSc, somehow my code broke, working on it.
> I am building off of example   ./ex45 
> I do use a stw = 2 and call  DMDACreate3d( ....dof,stw, 0, 0, 0, &da

  Box or star stencil? The box requires a HUGE amount more memory.

  So a stencil width of 2, so a higher order finite difference scheme? This will make the matrices have many more nonzeros in them then using a stencil width of one and thus use a lot more memory.  Plus for the best multigrid convergence you may need more then the default piecewise linear interpolation between levels. Or you might consider using the 4th order scheme on the finest mesh and 2nd order operators on all the other levels.

> 
> I am set options as follows
> -da_refine 3
> -pc_type mg
> -pc_mg_type full
> KSPSOLVER is set to KSPLMGRES

  Do you mean KSPLGMRES? You could skip the Krylov solver completely and use -ksp_type richardson to save the storage of the GMRES vectors (this may slow the convergence a little bit). 

  By default the course grid is solve with a direct solver so the smaller you have for a course grid the less “extra” memory usage for it.

> 
>> 
>> I have a 64 bit linux machine with 256 GByte of memory and 30 processors.

  Depending on the machines details you will likely get best performance using about 1/2 the cores. In the PETSc root directory run
make streams NPMAX=32 and send us the results

>> I am solving Laplace equations and the current solver I am using is taking every bit of memory I have available.
>> 
>> Do you have any suggestions (such as changing precision or changing solvers or something else) in order to reduce my memory footprint?

  What else is in the simulation? Surely you are not __just__ solving a Laplace equation. Is the rest of the simulation done explicitly so there are no matrices for it? Note in this case the errors from the operator splitting may be so large as to remove any benefit of using a 4th order discretization for Laplace.

  I made some runs on my laptop to indicate how large a problem I can do on my laptop with 4 gigabytes of memory (note that each refinement of the mesh will require roughly 16 times more memory)

~/Src/petsc/src/ksp/ksp/examples/tutorials (maint=)
$ ./ex45 -da_refine 3 -pc_type mg -pc_mg_type full -ksp_monitor -ksp_type richardson -memory_info
  0 KSP Residual norm 3.264230720534e+02 
  1 KSP Residual norm 2.731900801719e+01 
  2 KSP Residual norm 3.524392472835e-01 
  3 KSP Residual norm 5.847297338171e-03 
  4 KSP Residual norm 1.932667627422e-04 
Residual norm 1.18508e-05
Summary of Memory Usage in PETSc
[0]Current space PetscMalloc()ed 48240, max space PetscMalloced() 5.20534e+07
[0]Current process memory 6.18824e+07 max process memory 6.18824e+07
~/Src/petsc/src/ksp/ksp/examples/tutorials (maint=)
$ ./ex45 -da_refine 3 -pc_type mg -pc_mg_type full -ksp_monitor -ksp_type gmres -memory_info
  0 KSP Residual norm 3.264230720534e+02 
  1 KSP Residual norm 1.667893748550e+01 
  2 KSP Residual norm 3.851411754724e-02 
  3 KSP Residual norm 7.837312289688e-04 
Residual norm 4.7335e-05
Summary of Memory Usage in PETSc
[0]Current space PetscMalloc()ed 48240, max space PetscMalloced() 6.43328e+07
[0]Current process memory 7.44161e+07 max process memory 7.44161e+07
~/Src/petsc/src/ksp/ksp/examples/tutorials (maint=)
$ ./ex45 -da_refine 5 -pc_type mg -pc_mg_type full -ksp_monitor -ksp_type gmres -memory_info
  0 KSP Residual norm 2.461787435384e+03 
  1 KSP Residual norm 9.812293053603e+01 
  2 KSP Residual norm 2.213037728680e-01 
  3 KSP Residual norm 4.320723869852e-03 
Residual norm 5.13901e-05
Summary of Memory Usage in PETSc
[0]Current space PetscMalloc()ed 49840, max space PetscMalloced() 3.89334e+09
[0]Current process memory 3.91453e+09 max process memory 3.91453e+09
~/Src/petsc/src/ksp/ksp/examples/tutorials (maint=)
$ ./ex45 -da_refine 5 -pc_type mg -pc_mg_type full -ksp_monitor -ksp_type richardson -memory_info
  0 KSP Residual norm 2.461787435384e+03 
  1 KSP Residual norm 2.819553940156e+02 
  2 KSP Residual norm 3.664748917714e+00 
  3 KSP Residual norm 5.791285265094e-02 
  4 KSP Residual norm 1.333000775504e-03 
Residual norm 7.3202e-06
Summary of Memory Usage in PETSc
[0]Current space PetscMalloc()ed 49840, max space PetscMalloced() 3.14564e+09
[0]Current process memory 3.16655e+09 max process memory 3.16655e+09
~/Src/petsc/src/ksp/ksp/examples/tutorials (maint=)
$ ./ex45 -da_refine 5 -pc_type mg -pc_mg_type full -ksp_monitor -ksp_type richardson -memory_info -ksp_view
  0 KSP Residual norm 2.461787435384e+03 
  1 KSP Residual norm 2.819553940156e+02 
  2 KSP Residual norm 3.664748917714e+00 
  3 KSP Residual norm 5.791285265094e-02 
  4 KSP Residual norm 1.333000775504e-03 
KSP Object: 1 MPI processes
  type: richardson
    Richardson: damping factor=1
  maximum iterations=10000
  tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
  left preconditioning
  using nonzero initial guess
  using PRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: mg
    MG: type is FULL, levels=6 cycles=v
      Not using Galerkin computed coarse grid matrices
  Coarse grid solver -- level -------------------------------
    KSP Object:    (mg_coarse_)     1 MPI processes
      type: preonly
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_coarse_)     1 MPI processes
      type: lu
        LU: out-of-place factorization
        tolerance for zero pivot 2.22045e-14
        using diagonal shift on blocks to prevent zero pivot [INBLOCKS]
        matrix ordering: nd
        factor fill ratio given 5, needed 6.29283
          Factored matrix follows:
            Mat Object:             1 MPI processes
              type: seqaij
              rows=343, cols=343
              package used to perform factorization: petsc
              total: nonzeros=13259, allocated nonzeros=13259
              total number of mallocs used during MatSetValues calls =0
                not using I-node routines
      linear system matrix = precond matrix:
      Mat Object:       1 MPI processes
        type: seqaij
        rows=343, cols=343
        total: nonzeros=2107, allocated nonzeros=2107
        total number of mallocs used during MatSetValues calls =0
          not using I-node routines
  Down solver (pre-smoother) on level 1 -------------------------------
    KSP Object:    (mg_levels_1_)     1 MPI processes
      type: chebyshev
        Chebyshev: eigenvalue estimates:  min = 0.0999994, max = 1.09999
        Chebyshev: estimated using:  [0 0.1; 0 1.1]
        KSP Object:        (mg_levels_1_est_)         1 MPI processes
          type: gmres
            GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
            GMRES: happy breakdown tolerance 1e-30
          maximum iterations=10
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
          left preconditioning
          using nonzero initial guess
          using NONE norm type for convergence test
        PC Object:        (mg_levels_1_)         1 MPI processes
          type: sor
            SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1
          linear system matrix = precond matrix:
          Mat Object:           1 MPI processes
            type: seqaij
            rows=2197, cols=2197
            total: nonzeros=14365, allocated nonzeros=14365
            total number of mallocs used during MatSetValues calls =0
              not using I-node routines
      maximum iterations=2
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_1_)     1 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1
      linear system matrix = precond matrix:
      Mat Object:       1 MPI processes
        type: seqaij
        rows=2197, cols=2197
        total: nonzeros=14365, allocated nonzeros=14365
        total number of mallocs used during MatSetValues calls =0
          not using I-node routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 2 -------------------------------
    KSP Object:    (mg_levels_2_)     1 MPI processes
      type: chebyshev
        Chebyshev: eigenvalue estimates:  min = 0.0999989, max = 1.09999
        Chebyshev: estimated using:  [0 0.1; 0 1.1]
        KSP Object:        (mg_levels_2_est_)         1 MPI processes
          type: gmres
            GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
            GMRES: happy breakdown tolerance 1e-30
          maximum iterations=10
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
          left preconditioning
          using nonzero initial guess
          using NONE norm type for convergence test
        PC Object:        (mg_levels_2_)         1 MPI processes
          type: sor
            SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1
          linear system matrix = precond matrix:
          Mat Object:           1 MPI processes
            type: seqaij
            rows=15625, cols=15625
            total: nonzeros=105625, allocated nonzeros=105625
            total number of mallocs used during MatSetValues calls =0
              not using I-node routines
      maximum iterations=2
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_2_)     1 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1
      linear system matrix = precond matrix:
      Mat Object:       1 MPI processes
        type: seqaij
        rows=15625, cols=15625
        total: nonzeros=105625, allocated nonzeros=105625
        total number of mallocs used during MatSetValues calls =0
          not using I-node routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 3 -------------------------------
    KSP Object:    (mg_levels_3_)     1 MPI processes
      type: chebyshev
        Chebyshev: eigenvalue estimates:  min = 0.0999994, max = 1.09999
        Chebyshev: estimated using:  [0 0.1; 0 1.1]
        KSP Object:        (mg_levels_3_est_)         1 MPI processes
          type: gmres
            GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
            GMRES: happy breakdown tolerance 1e-30
          maximum iterations=10
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
          left preconditioning
          using nonzero initial guess
          using NONE norm type for convergence test
        PC Object:        (mg_levels_3_)         1 MPI processes
          type: sor
            SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1
          linear system matrix = precond matrix:
          Mat Object:           1 MPI processes
            type: seqaij
            rows=117649, cols=117649
            total: nonzeros=809137, allocated nonzeros=809137
            total number of mallocs used during MatSetValues calls =0
              not using I-node routines
      maximum iterations=2
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_3_)     1 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1
      linear system matrix = precond matrix:
      Mat Object:       1 MPI processes
        type: seqaij
        rows=117649, cols=117649
        total: nonzeros=809137, allocated nonzeros=809137
        total number of mallocs used during MatSetValues calls =0
          not using I-node routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 4 -------------------------------
    KSP Object:    (mg_levels_4_)     1 MPI processes
      type: chebyshev
        Chebyshev: eigenvalue estimates:  min = 0.0999997, max = 1.1
        Chebyshev: estimated using:  [0 0.1; 0 1.1]
        KSP Object:        (mg_levels_4_est_)         1 MPI processes
          type: gmres
            GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
            GMRES: happy breakdown tolerance 1e-30
          maximum iterations=10
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
          left preconditioning
          using nonzero initial guess
          using NONE norm type for convergence test
        PC Object:        (mg_levels_4_)         1 MPI processes
          type: sor
            SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1
          linear system matrix = precond matrix:
          Mat Object:           1 MPI processes
            type: seqaij
            rows=912673, cols=912673
            total: nonzeros=6.33226e+06, allocated nonzeros=6.33226e+06
            total number of mallocs used during MatSetValues calls =0
              not using I-node routines
      maximum iterations=2
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_4_)     1 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1
      linear system matrix = precond matrix:
      Mat Object:       1 MPI processes
        type: seqaij
        rows=912673, cols=912673
        total: nonzeros=6.33226e+06, allocated nonzeros=6.33226e+06
        total number of mallocs used during MatSetValues calls =0
          not using I-node routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 5 -------------------------------
    KSP Object:    (mg_levels_5_)     1 MPI processes
      type: chebyshev
        Chebyshev: eigenvalue estimates:  min = 0.0999998, max = 1.1
        Chebyshev: estimated using:  [0 0.1; 0 1.1]
        KSP Object:        (mg_levels_5_est_)         1 MPI processes
          type: gmres
            GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
            GMRES: happy breakdown tolerance 1e-30
          maximum iterations=10
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
          left preconditioning
          using nonzero initial guess
          using NONE norm type for convergence test
        PC Object:        (mg_levels_5_)         1 MPI processes
          type: sor
            SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1
          linear system matrix = precond matrix:
          Mat Object:           1 MPI processes
            type: seqaij
            rows=7189057, cols=7189057
            total: nonzeros=5.00999e+07, allocated nonzeros=5.00999e+07
            total number of mallocs used during MatSetValues calls =0
              not using I-node routines
      maximum iterations=2
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_5_)     1 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1
      linear system matrix = precond matrix:
      Mat Object:       1 MPI processes
        type: seqaij
        rows=7189057, cols=7189057
        total: nonzeros=5.00999e+07, allocated nonzeros=5.00999e+07
        total number of mallocs used during MatSetValues calls =0
          not using I-node routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  linear system matrix = precond matrix:
  Mat Object:   1 MPI processes
    type: seqaij
    rows=7189057, cols=7189057
    total: nonzeros=5.00999e+07, allocated nonzeros=5.00999e+07
    total number of mallocs used during MatSetValues calls =0
      not using I-node routines
Residual norm 7.3202e-06

  Note that running this way stores explicitly the sparse matrix on each level and the interpolation matrix between levels.

  With the 7 point stencil the matrix has 7 nonzeros per row. Using the Compressed sparse row storage format requires an integer for each non-zero in the matrix thus storing the matrix requires the same memory as roughly 10.5 vectors; while the interpolation matrix likely requires maybe the space of 3 vectors. Now each level requires a matrix plus an interpolation (except the coarse level doesn’t need an interpolation) so lets just say 14 vectors per level; if the multigrid algorithm requires 3 vectors per level (for right hand side, solution, and residual computation) then the matrices are taking say 3 times the memory of the vectors. 

 If it is a simple constant coefficient operator then one can, in theory, not store the matrices at all, seemingly freeing up a lot of memory (but at the cost of writing or borrowing a  custom code). Sounds great but now let’s ask the question how many more levels could you solve with if you didn’t store the matrices. Well if you refine 1 more time the size of the new vector will be 16 times the size of the previous vector. This will swamp the previous memory used for matrices so my conclusion is that going matrix free will allow at most one more level of refinement. So ask your self is it worth spending a ton of time writing a matrix free multigrid solver that will only let you get to one more level of refinement? Note that my estimates are for the 7 point stencil, if you are using a double wide stencil and 4th order differencing in your matrix, the matrices will take a lot more memory so maybe you can get a couple more levels of refinement. 

  Barry

>> 
>> Many Thanks
>> John
>