[petsc-users] Question about memory usage in Multigrid preconditioner

Fri Sep 9 10:56:35 CDT 2016

Hi,

I want to continue digging into the memory problem here.
I did find a work around in the past, which is to use less cores per 
node so that each core has 8G memory. However this is deficient and 
expensive. I hope to locate the place that uses the most memory.

Here is a brief summary of the tests I did in past:
 > Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12
Maximum (over computational time) process memory:           total 
7.0727e+08
Current process memory: total 7.0727e+08
Maximum (over computational time) space PetscMalloc()ed:  total 6.3908e+11
Current space PetscMalloc()ed:           total 1.8275e+09

 > Test2:    Mesh 1536*128*384  |  Process Mesh 96*8*24
Maximum (over computational time) process memory:           total 
5.9431e+09
Current process memory:             total 5.9431e+09
Maximum (over computational time) space PetscMalloc()ed:  total 5.3202e+12
Current space PetscMalloc()ed: total 5.4844e+09

 > Test3:    Mesh 3072*256*768  |  Process Mesh 96*8*24
     OOM( Out Of Memory ) killer of the supercomputer terminated the job 
during "KSPSolve".

I attached the output of ksp_view( the third test's output is from 
ksp_view_pre ), memory_view and also the petsc options.

In all the tests, each core can access about 2G memory. In test3, there 
are 4223139840 non-zeros in the matrix. This will consume about 1.74M, 
using double precision. Considering some extra memory used to store 
integer index, 2G memory should still be way enough.

Is there a way to find out which part of KSPSolve uses the most memory?
Thank you so much.

BTW, there are 4 options remains unused and I don't understand why they 
are omitted:
-mg_coarse_telescope_mg_coarse_ksp_type value: preonly
-mg_coarse_telescope_mg_coarse_pc_type value: bjacobi
-mg_coarse_telescope_mg_levels_ksp_max_it value: 1
-mg_coarse_telescope_mg_levels_ksp_type value: richardson

Regards,
Frank

On 07/13/2016 05:47 PM, Dave May wrote:
>
>
> On 14 July 2016 at 01:07, frank <hengjiew at uci.edu 
> <mailto:hengjiew at uci.edu>> wrote:
>
>     Hi Dave,
>
>     Sorry for the late reply.
>     Thank you so much for your detailed reply.
>
>     I have a question about the estimation of the memory usage. There
>     are 4223139840 allocated non-zeros and 18432 MPI processes. Double
>     precision is used. So the memory per process is:
>       4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?
>     Did I do sth wrong here? Because this seems too small.
>
>
> No - I totally f***ed it up. You are correct. That'll teach me for 
> fumbling around with my iphone calculator and not using my brain. 
> (Note that to convert to MB just divide by 1e6, not 1024^2 - although 
> I apparently cannot convert between units correctly....)
>
> From the PETSc objects associated with the solver, It looks like it 
> _should_ run with 2GB per MPI rank. Sorry for my mistake. 
> Possibilities are: somewhere in your usage of PETSc you've introduced 
> a memory leak; PETSc is doing a huge over allocation (e.g. as per our 
> discussion of MatPtAP); or in your application code there are other 
> objects you have forgotten to log the memory for.
>
>
>
>     I am running this job on Bluewater
>     <https://bluewaters.ncsa.illinois.edu/user-guide>
>
>     I am using the 7 points FD stencil in 3D.
>
>
> I thought so on both counts.
>
>
>     I apologize that I made a stupid mistake in computing the memory
>     per core. My settings render each core can access only 2G memory
>     on average instead of 8G which I mentioned in previous email. I
>     re-run the job with 8G memory per core on average and there is no
>     "Out Of Memory" error. I would do more test to see if there is
>     still some memory issue.
>
>
> Ok. I'd still like to know where the memory was being used since my 
> estimates were off.
>
>
> Thanks,
>   Dave
>
>
>     Regards,
>     Frank
>
>
>
>     On 07/11/2016 01:18 PM, Dave May wrote:
>>     Hi Frank,
>>
>>
>>     On 11 July 2016 at 19:14, frank <hengjiew at uci.edu
>>     <mailto:hengjiew at uci.edu>> wrote:
>>
>>         Hi Dave,
>>
>>         I re-run the test using bjacobi as the preconditioner on the
>>         coarse mesh of telescope. The Grid is 3072*256*768 and
>>         process mesh is 96*8*24. The petsc option file is attached.
>>         I still got the "Out Of Memory" error. The error occurred
>>         before the linear solver finished one step. So I don't have
>>         the full info from ksp_view. The info from ksp_view_pre is
>>         attached.
>>
>>
>>     Okay - that is essentially useless (sorry)
>>
>>
>>         It seems to me that the error occurred when the decomposition
>>         was going to be changed.
>>
>>
>>     Based on what information?
>>     Running with -info would give us more clues, but will create a
>>     ton of output.
>>     Please try running the case which failed with -info
>>
>>         I had another test with a grid of 1536*128*384 and the same
>>         process mesh as above. There was no error. The ksp_view info
>>         is attached for comparison.
>>         Thank you.
>>
>>
>>
>>     [3] Here is my crude estimate of your memory usage.
>>     I'll target the biggest memory hogs only to get an order of
>>     magnitude estimate
>>
>>     * The Fine grid operator contains 4223139840 non-zeros --> 1.8 GB
>>     per MPI rank assuming double precision.
>>     The indices for the AIJ could amount to another 0.3 GB (assuming
>>     32 bit integers)
>>
>>     * You use 5 levels of coarsening, so the other operators should
>>     represent (collectively)
>>     2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300 MB per MPI rank on
>>     the communicator with 18432 ranks.
>>     The coarse grid should consume ~ 0.5 MB per MPI rank on the
>>     communicator with 18432 ranks.
>>
>>     * You use a reduction factor of 64, making the new communicator
>>     with 288 MPI ranks.
>>     PCTelescope will first gather a temporary matrix associated with
>>     your coarse level operator assuming a comm size of 288 living on
>>     the comm with size 18432.
>>     This matrix will require approximately 0.5 * 64 = 32 MB per core
>>     on the 288 ranks.
>>     This matrix is then used to form a new MPIAIJ matrix on the
>>     subcomm, thus require another 32 MB per rank.
>>     The temporary matrix is now destroyed.
>>
>>     * Because a DMDA is detected, a permutation matrix is assembled.
>>     This requires 2 doubles per point in the DMDA.
>>     Your coarse DMDA contains 92 x 16 x 48 points.
>>     Thus the permutation matrix will require < 1 MB per MPI rank on
>>     the sub-comm.
>>
>>     * Lastly, the matrix is permuted. This uses MatPtAP(), but the
>>     resulting operator will have the same memory footprint as the
>>     unpermuted matrix (32 MB). At any stage in PCTelescope, only 2
>>     operators of size 32 MB are held in memory when the DMDA is provided.
>>
>>     From my rough estimates, the worst case memory foot print for any
>>     given core, given your options is approximately
>>     2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB
>>     This is way below 8 GB.
>>
>>     Note this estimate completely ignores:
>>     (1) the memory required for the restriction operator,
>>     (2) the potential growth in the number of non-zeros per row due
>>     to Galerkin coarsening (I wished -ksp_view_pre reported the
>>     output from MatView so we could see the number of non-zeros
>>     required by the coarse level operators)
>>     (3) all temporary vectors required by the CG solver, and those
>>     required by the smoothers.
>>     (4) internal memory allocated by MatPtAP
>>     (5) memory associated with IS's used within PCTelescope
>>
>>     So either I am completely off in my estimates, or you have not
>>     carefully estimated the memory usage of your application code.
>>     Hopefully others might examine/correct my rough estimates
>>
>>     Since I don't have your code I cannot access the latter.
>>     Since I don't have access to the same machine you are running on,
>>     I think we need to take a step back.
>>
>>     [1] What machine are you running on? Send me a URL if its available
>>
>>     [2] What discretization are you using? (I am guessing a scalar 7
>>     point FD stencil)
>>     If it's a 7 point FD stencil, we should be able to examine the
>>     memory usage of your solver configuration using a standard, light
>>     weight existing PETSc example, run on your machine at the same
>>     scale.
>>     This would hopefully enable us to correctly evaluate the actual
>>     memory usage required by the solver configuration you are using.
>>
>>     Thanks,
>>       Dave
>>
>>
>>
>>         Frank
>>
>>
>>
>>
>>         On 07/08/2016 10:38 PM, Dave May wrote:
>>>
>>>
>>>         On Saturday, 9 July 2016, frank <hengjiew at uci.edu
>>>         <mailto:hengjiew at uci.edu>> wrote:
>>>
>>>             Hi Barry and Dave,
>>>
>>>             Thank both of you for the advice.
>>>
>>>             @Barry
>>>             I made a mistake in the file names in last email. I
>>>             attached the correct files this time.
>>>             For all the three tests, 'Telescope' is used as the
>>>             coarse preconditioner.
>>>
>>>             == Test1:   Grid: 1536*128*384,  Process Mesh: 48*4*12
>>>             Part of the memory usage: Vector   125            124
>>>             3971904     0.
>>>                          Matrix   101 101   9462372     0
>>>
>>>             == Test2: Grid: 1536*128*384,  Process Mesh: 96*8*24
>>>             Part of the memory usage: Vector   125            124
>>>             681672     0.
>>>                          Matrix   101 101   1462180     0.
>>>
>>>             In theory, the memory usage in Test1 should be 8 times
>>>             of Test2. In my case, it is about 6 times.
>>>
>>>             == Test3: Grid: 3072*256*768,  Process Mesh: 96*8*24.
>>>             Sub-domain per process: 32*32*32
>>>             Here I get the out of memory error.
>>>
>>>             I tried to use -mg_coarse jacobi. In this way, I don't
>>>             need to set -mg_coarse_ksp_type and -mg_coarse_pc_type
>>>             explicitly, right?
>>>             The linear solver didn't work in this case. Petsc output
>>>             some errors.
>>>
>>>             @Dave
>>>             In test3, I use only one instance of 'Telescope'. On the
>>>             coarse mesh of 'Telescope', I used LU as the
>>>             preconditioner instead of SVD.
>>>             If my set the levels correctly, then on the last coarse
>>>             mesh of MG where it calls 'Telescope', the sub-domain
>>>             per process is 2*2*2.
>>>             On the last coarse mesh of 'Telescope', there is only
>>>             one grid point per process.
>>>             I still got the OOM error. The detailed petsc option
>>>             file is attached.
>>>
>>>
>>>         Do you understand the expected memory usage for the
>>>         particular parallel LU implementation you are using? I don't
>>>         (seriously). Replace LU with bjacobi and re-run this
>>>         test. My point about solver debugging is still valid.
>>>
>>>         And please send the result of KSPView so we can see what is
>>>         actually used in the computations
>>>
>>>         Thanks
>>>           Dave
>>>
>>>
>>>
>>>             Thank you so much.
>>>
>>>             Frank
>>>
>>>
>>>
>>>             On 07/06/2016 02:51 PM, Barry Smith wrote:
>>>
>>>                     On Jul 6, 2016, at 4:19 PM, frank
>>>                     <hengjiew at uci.edu <mailto:hengjiew at uci.edu>> wrote:
>>>
>>>                     Hi Barry,
>>>
>>>                     Thank you for you advice.
>>>                     I tried three test. In the 1st test, the grid is
>>>                     3072*256*768 and the process mesh is 96*8*24.
>>>                     The linear solver is 'cg' the preconditioner is
>>>                     'mg' and 'telescope' is used as the
>>>                     preconditioner at the coarse mesh.
>>>                     The system gives me the "Out of Memory" error
>>>                     before the linear system is completely solved.
>>>                     The info from '-ksp_view_pre' is attached. I
>>>                     seems to me that the error occurs when it
>>>                     reaches the coarse mesh.
>>>
>>>                     The 2nd test uses a grid of 1536*128*384 and
>>>                     process mesh is 96*8*24. The 3rd test uses the
>>>                     same grid but a different process mesh 48*4*12.
>>>
>>>                     Are you sure this is right? The total matrix and
>>>                 vector memory usage goes from 2nd test
>>>                                Vector   384         383     
>>>                 8,193,712  0.
>>>                                Matrix   103         103   
>>>                  11,508,688  0.
>>>                 to 3rd test
>>>                               Vector   384         383     
>>>                 1,590,520  0.
>>>                                Matrix   103         103     
>>>                 3,508,664  0.
>>>                 that is the memory usage got smaller but if you have
>>>                 only 1/8th the processes and the same grid it should
>>>                 have gotten about 8 times bigger. Did you maybe cut
>>>                 the grid by a factor of 8 also? If so that still
>>>                 doesn't explain it because the memory usage changed
>>>                 by a factor of 5 something for the vectors and 3
>>>                 something for the matrices.
>>>
>>>
>>>                     The linear solver and petsc options in 2nd and
>>>                     3rd tests are the same in 1st test. The linear
>>>                     solver works fine in both test.
>>>                     I attached the memory usage of the 2nd and 3rd
>>>                     tests. The memory info is from the option
>>>                     '-log_summary'. I tried to use '-momery_info' as
>>>                     you suggested, but in my case petsc treated it
>>>                     as an unused option. It output nothing about the
>>>                     memory. Do I need to add sth to my code so I can
>>>                     use '-memory_info'?
>>>
>>>                     Sorry, my mistake the option is -memory_view
>>>
>>>                    Can you run the one case with -memory_view and
>>>                 -mg_coarse jacobi -ksp_max_it 1 (just so it doesn't
>>>                 iterate forever) to see how much memory is used
>>>                 without the telescope? Also run case 2 the same way.
>>>
>>>                    Barry
>>>
>>>
>>>
>>>                     In both tests the memory usage is not large.
>>>
>>>                     It seems to me that it might be the 'telescope'
>>>                     preconditioner that allocated a lot of memory
>>>                     and caused the error in the 1st test.
>>>                     Is there is a way to show how much memory it
>>>                     allocated?
>>>
>>>                     Frank
>>>
>>>                     On 07/05/2016 03:37 PM, Barry Smith wrote:
>>>
>>>                          Frank,
>>>
>>>                              You can run with -ksp_view_pre to have
>>>                         it "view" the KSP before the solve so
>>>                         hopefully it gets that far.
>>>
>>>                               Please run the problem that does fit
>>>                         with -memory_info when the problem completes
>>>                         it will show the "high water mark" for PETSc
>>>                         allocated memory and total memory used. We
>>>                         first want to look at these numbers to see
>>>                         if it is using more memory than you expect.
>>>                         You could also run with say half the grid
>>>                         spacing to see how the memory usage scaled
>>>                         with the increase in grid points. Make the
>>>                         runs also with -log_view and send all the
>>>                         output from these options.
>>>
>>>                             Barry
>>>
>>>                             On Jul 5, 2016, at 5:23 PM, frank
>>>                             <hengjiew at uci.edu
>>>                             <mailto:hengjiew at uci.edu>> wrote:
>>>
>>>                             Hi,
>>>
>>>                             I am using the CG ksp solver and
>>>                             Multigrid preconditioner  to solve a
>>>                             linear system in parallel.
>>>                             I chose to use the 'Telescope' as the
>>>                             preconditioner on the coarse mesh for
>>>                             its good performance.
>>>                             The petsc options file is attached.
>>>
>>>                             The domain is a 3d box.
>>>                             It works well when the grid is 
>>>                             1536*128*384 and the process mesh is
>>>                             96*8*24. When I double the size of grid
>>>                             and keep the same process mesh and petsc
>>>                             options, I get an "out of memory" error
>>>                             from the super-cluster I am using.
>>>                             Each process has access to at least 8G
>>>                             memory, which should be more than enough
>>>                             for my application. I am sure that all
>>>                             the other parts of my code( except the
>>>                             linear solver ) do not use much memory.
>>>                             So I doubt if there is something wrong
>>>                             with the linear solver.
>>>                             The error occurs before the linear
>>>                             system is completely solved so I don't
>>>                             have the info from ksp view. I am not
>>>                             able to re-produce the error with a
>>>                             smaller problem either.
>>>                             In addition,  I tried to use the block
>>>                             jacobi as the preconditioner with the
>>>                             same grid and same decomposition. The
>>>                             linear solver runs extremely slow but
>>>                             there is no memory error.
>>>
>>>                             How can I diagnose what exactly cause
>>>                             the error?
>>>                             Thank you so much.
>>>
>>>                             Frank
>>>                             <petsc_options.txt>
>>>
>>>                     <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>
>>>
>>>
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160909/25db6fdb/attachment-0001.html>
-------------- next part --------------
Linear solve converged due to CONVERGED_ATOL iterations 0
KSP Object: 2304 MPI processes
  type: cg
  maximum iterations=10000
  tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
  left preconditioning
  using nonzero initial guess
  using UNPRECONDITIONED norm type for convergence test
PC Object: 2304 MPI processes
  type: mg
    MG: type is MULTIPLICATIVE, levels=4 cycles=v
      Cycles per PCApply=1
      Using Galerkin computed coarse grid matrices
  Coarse grid solver -- level -------------------------------
    KSP Object: (mg_coarse_) 2304 MPI processes
      type: preonly
      maximum iterations=10000, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_coarse_) 2304 MPI processes
      type: telescope
        Telescope: parent comm size reduction factor = 64
        Telescope: comm_size = 2304 , subcomm_size = 36
        Telescope: subcomm type: interlaced
          Telescope: DMDA detected
        DMDA Object:    (mg_coarse_telescope_repart_)    36 MPI processes
          M 192 N 16 P 48 m 12 n 1 p 3 dof 1 overlap 1
        KSP Object: (mg_coarse_telescope_) 36 MPI processes
          type: preonly
          maximum iterations=10000, initial guess is zero
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
          left preconditioning
          using DEFAULT norm type for convergence test
        PC Object: (mg_coarse_telescope_) 36 MPI processes
          type: mg
          PC has not been set up so information may be incomplete
            MG: type is MULTIPLICATIVE, levels=3 cycles=v
              Cycles per PCApply=1
              Using Galerkin computed coarse grid matrices
          Coarse grid solver -- level -------------------------------
            KSP Object: (mg_coarse_telescope_mg_coarse_) 36 MPI processes
              type: preonly
              maximum iterations=10000, initial guess is zero
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using DEFAULT norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_coarse_) 36 MPI processes
              type: redundant
              PC has not been set up so information may be incomplete
                Redundant preconditioner: Not yet setup
          Down solver (pre-smoother) on level 1 -------------------------------
            KSP Object: (mg_coarse_telescope_mg_levels_1_) 36 MPI processes
              type: chebyshev
                Chebyshev: eigenvalue estimates:  min = 0., max = 0.
              maximum iterations=2, initial guess is zero
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using NONE norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_levels_1_) 36 MPI processes
              type: sor
              PC has not been set up so information may be incomplete
                SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
          Up solver (post-smoother) same as down solver (pre-smoother)
          Down solver (pre-smoother) on level 2 -------------------------------
            KSP Object: (mg_coarse_telescope_mg_levels_2_) 36 MPI processes
              type: chebyshev
                Chebyshev: eigenvalue estimates:  min = 0., max = 0.
              maximum iterations=2, initial guess is zero
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using NONE norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_levels_2_) 36 MPI processes
              type: sor
              PC has not been set up so information may be incomplete
                SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
          Up solver (post-smoother) same as down solver (pre-smoother)
          linear system matrix = precond matrix:
          Mat Object: 36 MPI processes
            type: mpiaij
            rows=147456, cols=147456
            total: nonzeros=1013760, allocated nonzeros=1013760
            total number of mallocs used during MatSetValues calls =0
              not using I-node (on process 0) routines
      linear system matrix = precond matrix:
      Mat Object: 2304 MPI processes
        type: mpiaij
        rows=147456, cols=147456
        total: nonzeros=1013760, allocated nonzeros=1013760
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Down solver (pre-smoother) on level 1 -------------------------------
    KSP Object: (mg_levels_1_) 2304 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_1_) 2304 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 2304 MPI processes
        type: mpiaij
        rows=1179648, cols=1179648
        total: nonzeros=8183808, allocated nonzeros=8183808
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 2 -------------------------------
    KSP Object: (mg_levels_2_) 2304 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_2_) 2304 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 2304 MPI processes
        type: mpiaij
        rows=9437184, cols=9437184
        total: nonzeros=65765376, allocated nonzeros=65765376
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 3 -------------------------------
    KSP Object: (mg_levels_3_) 2304 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_3_) 2304 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 2304 MPI processes
        type: mpiaij
        rows=75497472, cols=75497472
        total: nonzeros=527302656, allocated nonzeros=527302656
        total number of mallocs used during MatSetValues calls =0
          has attached null space
  Up solver (post-smoother) same as down solver (pre-smoother)
  linear system matrix = precond matrix:
  Mat Object: 2304 MPI processes
    type: mpiaij
    rows=75497472, cols=75497472
    total: nonzeros=527302656, allocated nonzeros=527302656
    total number of mallocs used during MatSetValues calls =0
      has attached null space
-------------- next part --------------
KSP Object: 18432 MPI processes
  type: cg
  maximum iterations=10000
  tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
  left preconditioning
  using nonzero initial guess
  using UNPRECONDITIONED norm type for convergence test
PC Object: 18432 MPI processes
  type: mg
  PC has not been set up so information may be incomplete
    MG: type is MULTIPLICATIVE, levels=3 cycles=v
      Cycles per PCApply=1
      Using Galerkin computed coarse grid matrices
  Coarse grid solver -- level -------------------------------
    KSP Object: (mg_coarse_) 18432 MPI processes
      type: preonly
      maximum iterations=10000, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using DEFAULT norm type for convergence test
    PC Object: (mg_coarse_) 18432 MPI processes
      type: redundant
      PC has not been set up so information may be incomplete
        Redundant preconditioner: Not yet setup
  Down solver (pre-smoother) on level 1 -------------------------------
    KSP Object: (mg_levels_1_) 18432 MPI processes
      type: chebyshev
        Chebyshev: eigenvalue estimates:  min = 0., max = 0.
      maximum iterations=2, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_1_) 18432 MPI processes
      type: sor
      PC has not been set up so information may be incomplete
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 2 -------------------------------
    KSP Object: (mg_levels_2_) 18432 MPI processes
      type: chebyshev
        Chebyshev: eigenvalue estimates:  min = 0., max = 0.
      maximum iterations=2, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_2_) 18432 MPI processes
      type: sor
      PC has not been set up so information may be incomplete
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
  Up solver (post-smoother) same as down solver (pre-smoother)
  linear system matrix = precond matrix:
  Mat Object: 18432 MPI processes
    type: mpiaij
    rows=75497472, cols=75497472
    total: nonzeros=527302656, allocated nonzeros=527302656
    total number of mallocs used during MatSetValues calls =0
      has attached null space
Linear solve converged due to CONVERGED_RTOL iterations 131
KSP Object: 18432 MPI processes
  type: cg
  maximum iterations=10000
  tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
  left preconditioning
  using nonzero initial guess
  using UNPRECONDITIONED norm type for convergence test
PC Object: 18432 MPI processes
  type: mg
    MG: type is MULTIPLICATIVE, levels=3 cycles=v
      Cycles per PCApply=1
      Using Galerkin computed coarse grid matrices
  Coarse grid solver -- level -------------------------------
    KSP Object: (mg_coarse_) 18432 MPI processes
      type: preonly
      maximum iterations=10000, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_coarse_) 18432 MPI processes
      type: telescope
        Telescope: parent comm size reduction factor = 64
        Telescope: comm_size = 18432 , subcomm_size = 288
        Telescope: subcomm type: interlaced
          Telescope: DMDA detected
        DMDA Object:    (mg_coarse_telescope_repart_)    288 MPI processes
          M 384 N 32 P 96 m 24 n 2 p 6 dof 1 overlap 1
        KSP Object: (mg_coarse_telescope_) 288 MPI processes
          type: preonly
          maximum iterations=10000, initial guess is zero
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
          left preconditioning
          using NONE norm type for convergence test
        PC Object: (mg_coarse_telescope_) 288 MPI processes
          type: mg
            MG: type is MULTIPLICATIVE, levels=3 cycles=v
              Cycles per PCApply=1
              Using Galerkin computed coarse grid matrices
          Coarse grid solver -- level -------------------------------
            KSP Object: (mg_coarse_telescope_mg_coarse_) 288 MPI processes
              type: preonly
              maximum iterations=10000, initial guess is zero
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using NONE norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_coarse_) 288 MPI processes
              type: bjacobi
                block Jacobi: number of blocks = 288
                Local solve is same for all blocks, in the following KSP and PC objects:
              linear system matrix = precond matrix:
              Mat Object: 288 MPI processes
                type: mpiaij
                rows=18432, cols=18432
                total: nonzeros=124416, allocated nonzeros=124416
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Down solver (pre-smoother) on level 1 -------------------------------
            KSP Object: (mg_coarse_telescope_mg_levels_1_) 288 MPI processes
              type: richardson
                Richardson: damping factor=1.
              maximum iterations=1
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using nonzero initial guess
              using NONE norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_levels_1_) 288 MPI processes
              type: sor
                SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
              linear system matrix = precond matrix:
              Mat Object: 288 MPI processes
                type: mpiaij
                rows=147456, cols=147456
                total: nonzeros=1013760, allocated nonzeros=1013760
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Up solver (post-smoother) same as down solver (pre-smoother)
          Down solver (pre-smoother) on level 2 -------------------------------
            KSP Object: (mg_coarse_telescope_mg_levels_2_) 288 MPI processes
              type: richardson
                Richardson: damping factor=1.
              maximum iterations=1
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using nonzero initial guess
              using NONE norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_levels_2_) 288 MPI processes
              type: sor
                SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
              linear system matrix = precond matrix:
              Mat Object: 288 MPI processes
                type: mpiaij
                rows=1179648, cols=1179648
                total: nonzeros=8183808, allocated nonzeros=8183808
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Up solver (post-smoother) same as down solver (pre-smoother)
          linear system matrix = precond matrix:
          Mat Object: 288 MPI processes
            type: mpiaij
            rows=1179648, cols=1179648
            total: nonzeros=8183808, allocated nonzeros=8183808
            total number of mallocs used during MatSetValues calls =0
              not using I-node (on process 0) routines
                    KSP Object:       (mg_coarse_telescope_mg_coarse_sub_)       1 MPI processes
                      type: preonly
                      maximum iterations=10000, initial guess is zero
                      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
                      left preconditioning
                      using NONE norm type for convergence test
                    PC Object:       (mg_coarse_telescope_mg_coarse_sub_)       1 MPI processes
                      type: ilu
                        ILU: out-of-place factorization
                        0 levels of fill
                        tolerance for zero pivot 2.22045e-14
                        matrix ordering: natural
                        factor fill ratio given 1., needed 1.
                          Factored matrix follows:
                            Mat Object:       1 MPI processes
                              type: seqaij
                              rows=64, cols=64
                              package used to perform factorization: petsc
                              total: nonzeros=352, allocated nonzeros=352
                              total number of mallocs used during MatSetValues calls =0
                                not using I-node routines
                      linear system matrix = precond matrix:
                      Mat Object:       1 MPI processes
                        type: seqaij
                        rows=64, cols=64
                        total: nonzeros=352, allocated nonzeros=352
                        total number of mallocs used during MatSetValues calls =0
                          not using I-node routines
      linear system matrix = precond matrix:
      Mat Object: 18432 MPI processes
        type: mpiaij
        rows=1179648, cols=1179648
        total: nonzeros=8183808, allocated nonzeros=8183808
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Down solver (pre-smoother) on level 1 -------------------------------
    KSP Object: (mg_levels_1_) 18432 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_1_) 18432 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 18432 MPI processes
        type: mpiaij
        rows=9437184, cols=9437184
        total: nonzeros=65765376, allocated nonzeros=65765376
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 2 -------------------------------
    KSP Object: (mg_levels_2_) 18432 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_2_) 18432 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 18432 MPI processes
        type: mpiaij
        rows=75497472, cols=75497472
        total: nonzeros=527302656, allocated nonzeros=527302656
        total number of mallocs used during MatSetValues calls =0
          has attached null space
  Up solver (post-smoother) same as down solver (pre-smoother)
  linear system matrix = precond matrix:
  Mat Object: 18432 MPI processes
    type: mpiaij
    rows=75497472, cols=75497472
    total: nonzeros=527302656, allocated nonzeros=527302656
    total number of mallocs used during MatSetValues calls =0
      has attached null space
-------------- next part --------------
KSP Object: 18432 MPI processes
  type: cg
  maximum iterations=10000
  tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
  left preconditioning
  using nonzero initial guess
  using UNPRECONDITIONED norm type for convergence test
PC Object: 18432 MPI processes
  type: mg
  PC has not been set up so information may be incomplete
    MG: type is MULTIPLICATIVE, levels=4 cycles=v
      Cycles per PCApply=1
      Using Galerkin computed coarse grid matrices
  Coarse grid solver -- level -------------------------------
    KSP Object: (mg_coarse_) 18432 MPI processes
      type: preonly
      maximum iterations=10000, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using DEFAULT norm type for convergence test
    PC Object: (mg_coarse_) 18432 MPI processes
      type: redundant
      PC has not been set up so information may be incomplete
        Redundant preconditioner: Not yet setup
  Down solver (pre-smoother) on level 1 -------------------------------
    KSP Object: (mg_levels_1_) 18432 MPI processes
      type: chebyshev
        Chebyshev: eigenvalue estimates:  min = 0., max = 0.
      maximum iterations=2, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_1_) 18432 MPI processes
      type: sor
      PC has not been set up so information may be incomplete
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 2 -------------------------------
    KSP Object: (mg_levels_2_) 18432 MPI processes
      type: chebyshev
        Chebyshev: eigenvalue estimates:  min = 0., max = 0.
      maximum iterations=2, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_2_) 18432 MPI processes
      type: sor
      PC has not been set up so information may be incomplete
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 3 -------------------------------
    KSP Object: (mg_levels_3_) 18432 MPI processes
      type: chebyshev
        Chebyshev: eigenvalue estimates:  min = 0., max = 0.
      maximum iterations=2, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_3_) 18432 MPI processes
      type: sor
      PC has not been set up so information may be incomplete
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
  Up solver (post-smoother) same as down solver (pre-smoother)
  linear system matrix = precond matrix:
  Mat Object: 18432 MPI processes
    type: mpiaij
    rows=603979776, cols=603979776
    total: nonzeros=4223139840, allocated nonzeros=4223139840
    total number of mallocs used during MatSetValues calls =0
      has attached null space
-------------- next part --------------
Summary of Memory Usage in PETSc
Maximum (over computational time) process memory:        total 7.0727e+08 max 3.0916e+05 min 3.0554e+05
Current process memory:                                  total 7.0727e+08 max 3.0916e+05 min 3.0554e+05
Maximum (over computational time) space PetscMalloc()ed: total 6.3908e+11 max 2.7844e+08 min 2.7726e+08
Current space PetscMalloc()ed:                           total 1.8275e+09 max 8.1280e+05 min 7.7357e+05

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Viewer     6              5         4160     0.
              Vector    69             68      3491192     0.
      Vector Scatter    16             12        46144     0.
              Matrix    59             59      7940068     0.
   Matrix Null Space     1              1          592     0.
    Distributed Mesh     5              1         5104     0.
Star Forest Bipartite Graph    10              2         1696     0.
     Discrete System     5              1          876     0.
           Index Set    33             33       255408     0.
   IS L to G Mapping     5              1        21360     0.
       Krylov Solver     9              9        11112     0.
     DMKSP interface     3              0            0     0.
      Preconditioner     9              9         8880     0.
-------------- next part --------------
Summary of Memory Usage in PETSc
Maximum (over computational time) process memory:        total 5.9431e+09 max 3.2733e+05 min 3.1110e+05
Current process memory:                                  total 5.9431e+09 max 3.2733e+05 min 3.1110e+05
Maximum (over computational time) space PetscMalloc()ed: total 5.3202e+12 max 2.8973e+08 min 2.8858e+08
Current space PetscMalloc()ed:                           total 5.4844e+09 max 3.0005e+05 min 2.9005e+05

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Viewer     9              8         6656     0.
              Vector   480            479      6915440     0.
      Vector Scatter    19             16        55040     0.
              Matrix    73             73      2588332     0.
   Matrix Null Space     1              1          592     0.
    Distributed Mesh     6              3        15312     0.
Star Forest Bipartite Graph    12              6         5088     0.
     Discrete System     6              3         2628     0.
           Index Set    42             42        98496     0.
   IS L to G Mapping     6              3        28224     0.
       Krylov Solver     9              9        11032     0.
     DMKSP interface     4              2         1296     0.
      Preconditioner     9              9         8992     0.
-------------- next part --------------
-ksp_type        cg 
-ksp_norm_type   unpreconditioned
-ksp_lag_norm
-ksp_rtol        1e-7
-ksp_initial_guess_nonzero  yes
-ksp_converged_reason 
-ppe_max_iter 50
-pc_type mg
-pc_mg_galerkin
-pc_mg_levels 4
-mg_levels_ksp_type richardson 
-mg_levels_ksp_max_it 1
-mg_coarse_ksp_type preonly
-mg_coarse_pc_type telescope
-mg_coarse_pc_telescope_reduction_factor 64
-log_view
-memory_view
-matptap_scalable
-ksp_view
-options_left 1

# Setting dmdarepart on subcomm
-mg_coarse_telescope_repart_da_processors_x 12
-mg_coarse_telescope_repart_da_processors_y 1
-mg_coarse_telescope_repart_da_processors_z 3
-mg_coarse_telescope_ksp_type preonly
-mg_coarse_telescope_pc_type mg
-mg_coarse_telescope_pc_mg_galerkin
-mg_coarse_telescope_pc_mg_levels 3
-mg_coarse_telescope_mg_levels_ksp_max_it 1
-mg_coarse_telescope_mg_levels_ksp_type richardson
-mg_coarse_telescope_mg_coarse_ksp_type preonly
-mg_coarse_telescope_mg_coarse_pc_type bjacobi
-------------- next part --------------
-ksp_type        cg 
-ksp_norm_type   unpreconditioned
-ksp_lag_norm
-ksp_rtol        1e-7
-ksp_initial_guess_nonzero  yes
-ksp_converged_reason 
-ppe_max_iter 50
-pc_type mg
-pc_mg_galerkin
-pc_mg_levels 3
-mg_levels_ksp_type richardson 
-mg_levels_ksp_max_it 1
-mg_coarse_ksp_type preonly
-mg_coarse_pc_type telescope
-mg_coarse_pc_telescope_reduction_factor 64
-log_view
-memory_view
-matptap_scalable
-ksp_view
-options_left 1

# Setting dmdarepart on subcomm
-mg_coarse_telescope_repart_da_processors_x 24
-mg_coarse_telescope_repart_da_processors_y 2
-mg_coarse_telescope_repart_da_processors_z 6
-mg_coarse_telescope_ksp_type preonly
-mg_coarse_telescope_pc_type mg
-mg_coarse_telescope_pc_mg_galerkin
-mg_coarse_telescope_pc_mg_levels 3
-mg_coarse_telescope_mg_levels_ksp_max_it 1
-mg_coarse_telescope_mg_levels_ksp_type richardson
-mg_coarse_telescope_mg_coarse_ksp_type preonly
-mg_coarse_telescope_mg_coarse_pc_type bjacobi
-------------- next part --------------
-ksp_type        cg 
-ksp_norm_type   unpreconditioned
-ksp_lag_norm
-ksp_rtol        1e-7
-ksp_initial_guess_nonzero  yes
-ksp_converged_reason 
-ppe_max_iter 50
-pc_type mg
-pc_mg_galerkin
-pc_mg_levels 4
-mg_levels_ksp_type richardson 
-mg_levels_ksp_max_it 1
-mg_coarse_ksp_type preonly
-mg_coarse_pc_type telescope
-mg_coarse_pc_telescope_reduction_factor 64
-log_view
-memory_view
-matptap_scalable
-ksp_view
-ksp_view_pre
-options_left 1

# Setting dmdarepart on subcomm
-mg_coarse_telescope_repart_da_processors_x 24
-mg_coarse_telescope_repart_da_processors_y 2
-mg_coarse_telescope_repart_da_processors_z 6
-mg_coarse_telescope_ksp_type preonly
-mg_coarse_telescope_pc_type mg
-mg_coarse_telescope_pc_mg_galerkin
-mg_coarse_telescope_pc_mg_levels 3
-mg_coarse_telescope_mg_levels_ksp_max_it 1
-mg_coarse_telescope_mg_levels_ksp_type richardson
-mg_coarse_telescope_mg_coarse_ksp_type preonly
-mg_coarse_telescope_mg_coarse_pc_type bjacobi