[petsc-users] Question about memory usage in Multigrid preconditioner

Fri Sep 16 12:53:26 CDT 2016

Hi Dave,

I add both options and test it by solving the poisson eqn in  a 1024 
cube with 32^3 cores. This test used to give the OOM error. Now it runs 
well.
I attach the ksp_view and log_view's output in case you want to know.
I also test my original code with those petsc options by simulating a 
decaying turbulence in a 1024 cube. It also works.  I am going to test 
the code on a larger scale. If there is any problem then, I will let you 
know.
This really helps a lot. Thank you so much.

Regards,
Frank

On 9/15/2016 3:35 AM, Dave May wrote:
> HI all,
>
> I the only unexpected memory usage I can see is associated with the 
> call to MatPtAP().
> Here is something you can try immediately.
> Run your code with the additional options
>   -matrap 0 -matptap_scalable
>
> I didn't realize this before, but the default behaviour of MatPtAP in 
> parallel is actually to to explicitly form the transpose of P (e.g. 
> assemble R = P^T) and then compute R.A.P.
> You don't want to do this. The option -matrap 0 resolves this issue.
>
> The implementation of P^T.A.P has two variants.
> The scalable implementation (with respect to memory usage) is selected 
> via the second option -matptap_scalable.
>
> Try it out - I see a significant memory reduction using these options 
> for particular mesh sizes / partitions.
>
> I've attached a cleaned up version of the code you sent me.
> There were a number of memory leaks and other issues.
> The main points being
>   * You should call DMDAVecGetArrayF90() before VecAssembly{Begin,End}
>   * You should call PetscFinalize(), otherwise the option -log_summary 
> (-log_view) will not display anything once the program has completed.
>
>
> Thanks,
>   Dave
>
>
> On 15 September 2016 at 08:03, Hengjie Wang <hengjiew at uci.edu 
> <mailto:hengjiew at uci.edu>> wrote:
>
>     Hi Dave,
>
>     Sorry, I should have put more comment to explain the code.
>     The number of process in each dimension is the same: Px = Py=Pz=P.
>     So is the domain size.
>     So if the you want to run the code for a  512^3 grid points on
>     16^3 cores, you need to set "-N 512 -P 16" in the command line.
>     I add more comments and also fix an error in the attached code. (
>     The error only effects the accuracy of solution but not the memory
>     usage. )
>
>     Thank you.
>     Frank
>
>
>     On 9/14/2016 9:05 PM, Dave May wrote:
>>
>>
>>     On Thursday, 15 September 2016, Dave May <dave.mayhem23 at gmail.com
>>     <mailto:dave.mayhem23 at gmail.com>> wrote:
>>
>>
>>
>>         On Thursday, 15 September 2016, frank <hengjiew at uci.edu> wrote:
>>
>>             Hi,
>>
>>             I write a simple code to re-produce the error. I hope
>>             this can help to diagnose the problem.
>>             The code just solves a 3d poisson equation.
>>
>>
>>         Why is the stencil width a runtime parameter?? And why is the
>>         default value 2? For 7-pnt FD Laplace, you only need
>>         a stencil width of 1.
>>
>>         Was this choice made to mimic something in the
>>         real application code?
>>
>>
>>     Please ignore - I misunderstood your usage of the param set by -P
>>
>>
>>             I run the code on a 1024^3 mesh. The process partition is
>>             32 * 32 * 32. That's when I re-produce the OOM error.
>>             Each core has about 2G memory.
>>             I also run the code on a 512^3 mesh with 16 * 16 * 16
>>             processes. The ksp solver works fine.
>>             I attached the code, ksp_view_pre's output and my petsc
>>             option file.
>>
>>             Thank you.
>>             Frank
>>
>>             On 09/09/2016 06:38 PM, Hengjie Wang wrote:
>>>             Hi Barry,
>>>
>>>             I checked. On the supercomputer, I had the option
>>>             "-ksp_view_pre" but it is not in file I sent you. I am
>>>             sorry for the confusion.
>>>
>>>             Regards,
>>>             Frank
>>>
>>>             On Friday, September 9, 2016, Barry Smith
>>>             <bsmith at mcs.anl.gov> wrote:
>>>
>>>
>>>                 > On Sep 9, 2016, at 3:11 PM, frank
>>>                 <hengjiew at uci.edu> wrote:
>>>                 >
>>>                 > Hi Barry,
>>>                 >
>>>                 > I think the first KSP view output is from
>>>                 -ksp_view_pre. Before I submitted the test, I was
>>>                 not sure whether there would be OOM error or not. So
>>>                 I added both -ksp_view_pre and -ksp_view.
>>>
>>>                   But the options file you sent specifically does
>>>                 NOT list the -ksp_view_pre so how could it be from that?
>>>
>>>                    Sorry to be pedantic but I've spent too much time
>>>                 in the past trying to debug from incorrect
>>>                 information and want to make sure that the
>>>                 information I have is correct before thinking.
>>>                 Please recheck exactly what happened. Rerun with the
>>>                 exact input file you emailed if that is needed.
>>>
>>>                    Barry
>>>
>>>                 >
>>>                 > Frank
>>>                 >
>>>                 >
>>>                 > On 09/09/2016 12:38 PM, Barry Smith wrote:
>>>                 >>   Why does ksp_view2.txt have two KSP views in it
>>>                 while ksp_view1.txt has only one KSPView in it? Did
>>>                 you run two different solves in the 2 case but not
>>>                 the one?
>>>                 >>
>>>                 >>   Barry
>>>                 >>
>>>                 >>
>>>                 >>
>>>                 >>> On Sep 9, 2016, at 10:56 AM, frank
>>>                 <hengjiew at uci.edu> wrote:
>>>                 >>>
>>>                 >>> Hi,
>>>                 >>>
>>>                 >>> I want to continue digging into the memory
>>>                 problem here.
>>>                 >>> I did find a work around in the past, which is
>>>                 to use less cores per node so that each core has 8G
>>>                 memory. However this is deficient and expensive. I
>>>                 hope to locate the place that uses the most memory.
>>>                 >>>
>>>                 >>> Here is a brief summary of the tests I did in past:
>>>                 >>>> Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12
>>>                 >>> Maximum (over computational time) process
>>>                 memory:        total 7.0727e+08
>>>                 >>> Current process memory:                  total
>>>                 7.0727e+08
>>>                 >>> Maximum (over computational time) space
>>>                 PetscMalloc()ed:  total 6.3908e+11
>>>                 >>> Current space PetscMalloc()ed:                  
>>>                         total 1.8275e+09
>>>                 >>>
>>>                 >>>> Test2:    Mesh 1536*128*384  |  Process Mesh
>>>                 96*8*24
>>>                 >>> Maximum (over computational time) process
>>>                 memory:        total 5.9431e+09
>>>                 >>> Current process memory:                  total
>>>                 5.9431e+09
>>>                 >>> Maximum (over computational time) space
>>>                 PetscMalloc()ed:  total 5.3202e+12
>>>                 >>> Current space PetscMalloc()ed:                  
>>>                          total 5.4844e+09
>>>                 >>>
>>>                 >>>> Test3:    Mesh 3072*256*768  |  Process Mesh
>>>                 96*8*24
>>>                 >>>     OOM( Out Of Memory ) killer of the
>>>                 supercomputer terminated the job during "KSPSolve".
>>>                 >>>
>>>                 >>> I attached the output of ksp_view( the third
>>>                 test's output is from ksp_view_pre ), memory_view
>>>                 and also the petsc options.
>>>                 >>>
>>>                 >>> In all the tests, each core can access about 2G
>>>                 memory. In test3, there are 4223139840 non-zeros in
>>>                 the matrix. This will consume about 1.74M, using
>>>                 double precision. Considering some extra memory used
>>>                 to store integer index, 2G memory should still be
>>>                 way enough.
>>>                 >>>
>>>                 >>> Is there a way to find out which part of
>>>                 KSPSolve uses the most memory?
>>>                 >>> Thank you so much.
>>>                 >>>
>>>                 >>> BTW, there are 4 options remains unused and I
>>>                 don't understand why they are omitted:
>>>                 >>> -mg_coarse_telescope_mg_coarse_ksp_type value:
>>>                 preonly
>>>                 >>> -mg_coarse_telescope_mg_coarse_pc_type value:
>>>                 bjacobi
>>>                 >>> -mg_coarse_telescope_mg_levels_ksp_max_it value: 1
>>>                 >>> -mg_coarse_telescope_mg_levels_ksp_type value:
>>>                 richardson
>>>                 >>>
>>>                 >>>
>>>                 >>> Regards,
>>>                 >>> Frank
>>>                 >>>
>>>                 >>> On 07/13/2016 05:47 PM, Dave May wrote:
>>>                 >>>>
>>>                 >>>> On 14 July 2016 at 01:07, frank
>>>                 <hengjiew at uci.edu> wrote:
>>>                 >>>> Hi Dave,
>>>                 >>>>
>>>                 >>>> Sorry for the late reply.
>>>                 >>>> Thank you so much for your detailed reply.
>>>                 >>>>
>>>                 >>>> I have a question about the estimation of the
>>>                 memory usage. There are 4223139840 allocated
>>>                 non-zeros and 18432 MPI processes. Double precision
>>>                 is used. So the memory per process is:
>>>                 >>>>   4223139840 * 8bytes / 18432 / 1024 / 1024 =
>>>                 1.74M ?
>>>                 >>>> Did I do sth wrong here? Because this seems too
>>>                 small.
>>>                 >>>>
>>>                 >>>> No - I totally f***ed it up. You are correct.
>>>                 That'll teach me for fumbling around with my iphone
>>>                 calculator and not using my brain. (Note that to
>>>                 convert to MB just divide by 1e6, not 1024^2 -
>>>                 although I apparently cannot convert between units
>>>                 correctly....)
>>>                 >>>>
>>>                 >>>> From the PETSc objects associated with the
>>>                 solver, It looks like it _should_ run with 2GB per
>>>                 MPI rank. Sorry for my mistake. Possibilities are:
>>>                 somewhere in your usage of PETSc you've introduced a
>>>                 memory leak; PETSc is doing a huge over allocation
>>>                 (e.g. as per our discussion of MatPtAP); or in your
>>>                 application code there are other objects you have
>>>                 forgotten to log the memory for.
>>>                 >>>>
>>>                 >>>>
>>>                 >>>>
>>>                 >>>> I am running this job on Bluewater
>>>                 >>>> I am using the 7 points FD stencil in 3D.
>>>                 >>>>
>>>                 >>>> I thought so on both counts.
>>>                 >>>>
>>>                 >>>> I apologize that I made a stupid mistake in
>>>                 computing the memory per core. My settings render
>>>                 each core can access only 2G memory on average
>>>                 instead of 8G which I mentioned in previous email. I
>>>                 re-run the job with 8G memory per core on average
>>>                 and there is no "Out Of Memory" error. I would do
>>>                 more test to see if there is still some memory issue.
>>>                 >>>>
>>>                 >>>> Ok. I'd still like to know where the memory was
>>>                 being used since my estimates were off.
>>>                 >>>>
>>>                 >>>>
>>>                 >>>> Thanks,
>>>                 >>>>   Dave
>>>                 >>>>
>>>                 >>>> Regards,
>>>                 >>>> Frank
>>>                 >>>>
>>>                 >>>>
>>>                 >>>>
>>>                 >>>> On 07/11/2016 01:18 PM, Dave May wrote:
>>>                 >>>>> Hi Frank,
>>>                 >>>>>
>>>                 >>>>>
>>>                 >>>>> On 11 July 2016 at 19:14, frank
>>>                 <hengjiew at uci.edu> wrote:
>>>                 >>>>> Hi Dave,
>>>                 >>>>>
>>>                 >>>>> I re-run the test using bjacobi as the
>>>                 preconditioner on the coarse mesh of telescope. The
>>>                 Grid is 3072*256*768 and process mesh is 96*8*24.
>>>                 The petsc option file is attached.
>>>                 >>>>> I still got the "Out Of Memory" error. The
>>>                 error occurred before the linear solver finished one
>>>                 step. So I don't have the full info from ksp_view.
>>>                 The info from ksp_view_pre is attached.
>>>                 >>>>>
>>>                 >>>>> Okay - that is essentially useless (sorry)
>>>                 >>>>>
>>>                 >>>>> It seems to me that the error occurred when
>>>                 the decomposition was going to be changed.
>>>                 >>>>>
>>>                 >>>>> Based on what information?
>>>                 >>>>> Running with -info would give us more clues,
>>>                 but will create a ton of output.
>>>                 >>>>> Please try running the case which failed with
>>>                 -info
>>>                 >>>>>  I had another test with a grid of
>>>                 1536*128*384 and the same process mesh as above.
>>>                 There was no error. The ksp_view info is attached
>>>                 for comparison.
>>>                 >>>>> Thank you.
>>>                 >>>>>
>>>                 >>>>>
>>>                 >>>>> [3] Here is my crude estimate of your memory
>>>                 usage.
>>>                 >>>>> I'll target the biggest memory hogs only to
>>>                 get an order of magnitude estimate
>>>                 >>>>>
>>>                 >>>>> * The Fine grid operator contains 4223139840
>>>                 non-zeros --> 1.8 GB per MPI rank assuming double
>>>                 precision.
>>>                 >>>>> The indices for the AIJ could amount to
>>>                 another 0.3 GB (assuming 32 bit integers)
>>>                 >>>>>
>>>                 >>>>> * You use 5 levels of coarsening, so the other
>>>                 operators should represent (collectively)
>>>                 >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300
>>>                 MB per MPI rank on the communicator with 18432 ranks.
>>>                 >>>>> The coarse grid should consume ~ 0.5 MB per
>>>                 MPI rank on the communicator with 18432 ranks.
>>>                 >>>>>
>>>                 >>>>> * You use a reduction factor of 64, making the
>>>                 new communicator with 288 MPI ranks.
>>>                 >>>>> PCTelescope will first gather a temporary
>>>                 matrix associated with your coarse level operator
>>>                 assuming a comm size of 288 living on the comm with
>>>                 size 18432.
>>>                 >>>>> This matrix will require approximately 0.5 *
>>>                 64 = 32 MB per core on the 288 ranks.
>>>                 >>>>> This matrix is then used to form a new MPIAIJ
>>>                 matrix on the subcomm, thus require another 32 MB
>>>                 per rank.
>>>                 >>>>> The temporary matrix is now destroyed.
>>>                 >>>>>
>>>                 >>>>> * Because a DMDA is detected, a permutation
>>>                 matrix is assembled.
>>>                 >>>>> This requires 2 doubles per point in the DMDA.
>>>                 >>>>> Your coarse DMDA contains 92 x 16 x 48 points.
>>>                 >>>>> Thus the permutation matrix will require < 1
>>>                 MB per MPI rank on the sub-comm.
>>>                 >>>>>
>>>                 >>>>> * Lastly, the matrix is permuted. This uses
>>>                 MatPtAP(), but the resulting operator will have the
>>>                 same memory footprint as the unpermuted matrix (32
>>>                 MB). At any stage in PCTelescope, only 2 operators
>>>                 of size 32 MB are held in memory when the DMDA is
>>>                 provided.
>>>                 >>>>>
>>>                 >>>>> From my rough estimates, the worst case memory
>>>                 foot print for any given core, given your options is
>>>                 approximately
>>>                 >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB
>>>                 >>>>> This is way below 8 GB.
>>>                 >>>>>
>>>                 >>>>> Note this estimate completely ignores:
>>>                 >>>>> (1) the memory required for the restriction
>>>                 operator,
>>>                 >>>>> (2) the potential growth in the number of
>>>                 non-zeros per row due to Galerkin coarsening (I
>>>                 wished -ksp_view_pre reported the output from
>>>                 MatView so we could see the number of non-zeros
>>>                 required by the coarse level operators)
>>>                 >>>>> (3) all temporary vectors required by the CG
>>>                 solver, and those required by the smoothers.
>>>                 >>>>> (4) internal memory allocated by MatPtAP
>>>                 >>>>> (5) memory associated with IS's used within
>>>                 PCTelescope
>>>                 >>>>>
>>>                 >>>>> So either I am completely off in my estimates,
>>>                 or you have not carefully estimated the memory usage
>>>                 of your application code. Hopefully others might
>>>                 examine/correct my rough estimates
>>>                 >>>>>
>>>                 >>>>> Since I don't have your code I cannot access
>>>                 the latter.
>>>                 >>>>> Since I don't have access to the same machine
>>>                 you are running on, I think we need to take a step back.
>>>                 >>>>>
>>>                 >>>>> [1] What machine are you running on? Send me a
>>>                 URL if its available
>>>                 >>>>>
>>>                 >>>>> [2] What discretization are you using? (I am
>>>                 guessing a scalar 7 point FD stencil)
>>>                 >>>>> If it's a 7 point FD stencil, we should be
>>>                 able to examine the memory usage of your solver
>>>                 configuration using a standard, light weight
>>>                 existing PETSc example, run on your machine at the
>>>                 same scale.
>>>                 >>>>> This would hopefully enable us to correctly
>>>                 evaluate the actual memory usage required by the
>>>                 solver configuration you are using.
>>>                 >>>>>
>>>                 >>>>> Thanks,
>>>                 >>>>>   Dave
>>>                 >>>>>
>>>                 >>>>>
>>>                 >>>>> Frank
>>>                 >>>>>
>>>                 >>>>>
>>>                 >>>>>
>>>                 >>>>>
>>>                 >>>>> On 07/08/2016 10:38 PM, Dave May wrote:
>>>                 >>>>>>
>>>                 >>>>>> On Saturday, 9 July 2016, frank
>>>                 <hengjiew at uci.edu> wrote:
>>>                 >>>>>> Hi Barry and Dave,
>>>                 >>>>>>
>>>                 >>>>>> Thank both of you for the advice.
>>>                 >>>>>>
>>>                 >>>>>> @Barry
>>>                 >>>>>> I made a mistake in the file names in last
>>>                 email. I attached the correct files this time.
>>>                 >>>>>> For all the three tests, 'Telescope' is used
>>>                 as the coarse preconditioner.
>>>                 >>>>>>
>>>                 >>>>>> == Test1:  Grid: 1536*128*384,   Process
>>>                 Mesh: 48*4*12
>>>                 >>>>>> Part of the memory usage:  Vector   125 124
>>>                 3971904     0.
>>>                 >>>>>> Matrix   101 101      9462372     0
>>>                 >>>>>>
>>>                 >>>>>> == Test2: Grid: 1536*128*384,   Process Mesh:
>>>                 96*8*24
>>>                 >>>>>> Part of the memory usage:  Vector   125 124
>>>                 681672     0.
>>>                 >>>>>> Matrix   101 101      1462180     0.
>>>                 >>>>>>
>>>                 >>>>>> In theory, the memory usage in Test1 should
>>>                 be 8 times of Test2. In my case, it is about 6 times.
>>>                 >>>>>>
>>>                 >>>>>> == Test3: Grid: 3072*256*768,   Process Mesh:
>>>                 96*8*24. Sub-domain per process: 32*32*32
>>>                 >>>>>> Here I get the out of memory error.
>>>                 >>>>>>
>>>                 >>>>>> I tried to use -mg_coarse jacobi. In this
>>>                 way, I don't need to set -mg_coarse_ksp_type and
>>>                 -mg_coarse_pc_type explicitly, right?
>>>                 >>>>>> The linear solver didn't work in this case.
>>>                 Petsc output some errors.
>>>                 >>>>>>
>>>                 >>>>>> @Dave
>>>                 >>>>>> In test3, I use only one instance of
>>>                 'Telescope'. On the coarse mesh of 'Telescope', I
>>>                 used LU as the preconditioner instead of SVD.
>>>                 >>>>>> If my set the levels correctly, then on the
>>>                 last coarse mesh of MG where it calls 'Telescope',
>>>                 the sub-domain per process is 2*2*2.
>>>                 >>>>>> On the last coarse mesh of 'Telescope', there
>>>                 is only one grid point per process.
>>>                 >>>>>> I still got the OOM error. The detailed petsc
>>>                 option file is attached.
>>>                 >>>>>>
>>>                 >>>>>> Do you understand the expected memory usage
>>>                 for the particular parallel LU implementation you
>>>                 are using? I don't (seriously). Replace LU with
>>>                 bjacobi and re-run this test. My point about solver
>>>                 debugging is still valid.
>>>                 >>>>>>
>>>                 >>>>>> And please send the result of KSPView so we
>>>                 can see what is actually used in the computations
>>>                 >>>>>>
>>>                 >>>>>> Thanks
>>>                 >>>>>>   Dave
>>>                 >>>>>>
>>>                 >>>>>>
>>>                 >>>>>> Thank you so much.
>>>                 >>>>>>
>>>                 >>>>>> Frank
>>>                 >>>>>>
>>>                 >>>>>>
>>>                 >>>>>>
>>>                 >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:
>>>                 >>>>>> On Jul 6, 2016, at 4:19 PM, frank
>>>                 <hengjiew at uci.edu> wrote:
>>>                 >>>>>>
>>>                 >>>>>> Hi Barry,
>>>                 >>>>>>
>>>                 >>>>>> Thank you for you advice.
>>>                 >>>>>> I tried three test. In the 1st test, the grid
>>>                 is 3072*256*768 and the process mesh is 96*8*24.
>>>                 >>>>>> The linear solver is 'cg' the preconditioner
>>>                 is 'mg' and 'telescope' is used as the
>>>                 preconditioner at the coarse mesh.
>>>                 >>>>>> The system gives me the "Out of Memory" error
>>>                 before the linear system is completely solved.
>>>                 >>>>>> The info from '-ksp_view_pre' is attached. I
>>>                 seems to me that the error occurs when it reaches
>>>                 the coarse mesh.
>>>                 >>>>>>
>>>                 >>>>>> The 2nd test uses a grid of 1536*128*384 and
>>>                 process mesh is 96*8*24. The 3rd  test uses the same
>>>                 grid but a different process mesh 48*4*12.
>>>                 >>>>>>     Are you sure this is right? The total
>>>                 matrix and vector memory usage goes from 2nd test
>>>                 >>>>>>   Vector   384            383 8,193,712     0.
>>>                 >>>>>>   Matrix   103            103  11,508,688     0.
>>>                 >>>>>> to 3rd test
>>>                 >>>>>>  Vector   384            383 1,590,520     0.
>>>                 >>>>>>   Matrix   103            103 3,508,664     0.
>>>                 >>>>>> that is the memory usage got smaller but if
>>>                 you have only 1/8th the processes and the same grid
>>>                 it should have gotten about 8 times bigger. Did you
>>>                 maybe cut the grid by a factor of 8 also? If so that
>>>                 still doesn't explain it because the memory usage
>>>                 changed by a factor of 5 something for the vectors
>>>                 and 3 something for the matrices.
>>>                 >>>>>>
>>>                 >>>>>>
>>>                 >>>>>> The linear solver and petsc options in 2nd
>>>                 and 3rd tests are the same in 1st test. The linear
>>>                 solver works fine in both test.
>>>                 >>>>>> I attached the memory usage of the 2nd and
>>>                 3rd tests. The memory info is from the option
>>>                 '-log_summary'. I tried to use '-momery_info' as you
>>>                 suggested, but in my case petsc treated it as an
>>>                 unused option. It output nothing about the memory.
>>>                 Do I need to add sth to my code so I can use
>>>                 '-memory_info'?
>>>                 >>>>>>     Sorry, my mistake the option is -memory_view
>>>                 >>>>>>
>>>                 >>>>>>    Can you run the one case with -memory_view
>>>                 and -mg_coarse jacobi -ksp_max_it 1 (just so it
>>>                 doesn't iterate forever) to see how much memory is
>>>                 used without the telescope? Also run case 2 the same
>>>                 way.
>>>                 >>>>>>
>>>                 >>>>>>    Barry
>>>                 >>>>>>
>>>                 >>>>>>
>>>                 >>>>>>
>>>                 >>>>>> In both tests the memory usage is not large.
>>>                 >>>>>>
>>>                 >>>>>> It seems to me that it might be the
>>>                 'telescope' preconditioner that allocated a lot of
>>>                 memory and caused the error in the 1st test.
>>>                 >>>>>> Is there is a way to show how much memory it
>>>                 allocated?
>>>                 >>>>>>
>>>                 >>>>>> Frank
>>>                 >>>>>>
>>>                 >>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote:
>>>                 >>>>>>    Frank,
>>>                 >>>>>>
>>>                 >>>>>>      You can run with -ksp_view_pre to have
>>>                 it "view" the KSP before the solve so hopefully it
>>>                 gets that far.
>>>                 >>>>>>
>>>                 >>>>>>       Please run the problem that does fit
>>>                 with -memory_info when the problem completes it will
>>>                 show the "high water mark" for PETSc allocated
>>>                 memory and total memory used. We first want to look
>>>                 at these numbers to see if it is using more memory
>>>                 than you expect. You could also run with say half
>>>                 the grid spacing to see how the memory usage scaled
>>>                 with the increase in grid points. Make the runs also
>>>                 with -log_view and send all the output from these
>>>                 options.
>>>                 >>>>>>
>>>                 >>>>>>     Barry
>>>                 >>>>>>
>>>                 >>>>>> On Jul 5, 2016, at 5:23 PM, frank
>>>                 <hengjiew at uci.edu> wrote:
>>>                 >>>>>>
>>>                 >>>>>> Hi,
>>>                 >>>>>>
>>>                 >>>>>> I am using the CG ksp solver and Multigrid
>>>                 preconditioner  to solve a linear system in parallel.
>>>                 >>>>>> I chose to use the 'Telescope' as the
>>>                 preconditioner on the coarse mesh for its good
>>>                 performance.
>>>                 >>>>>> The petsc options file is attached.
>>>                 >>>>>>
>>>                 >>>>>> The domain is a 3d box.
>>>                 >>>>>> It works well when the grid is  1536*128*384
>>>                 and the process mesh is 96*8*24. When I double the
>>>                 size of grid and                                keep
>>>                 the same process mesh and petsc options, I get an
>>>                 "out of memory" error from the super-cluster I am using.
>>>                 >>>>>> Each process has access to at least 8G
>>>                 memory, which should be more than enough for my
>>>                 application. I am sure that all the other parts of
>>>                 my code( except the linear solver ) do not use much
>>>                 memory. So I doubt if there is something wrong with
>>>                 the linear solver.
>>>                 >>>>>> The error occurs before the linear system is
>>>                 completely solved so I don't have the info from ksp
>>>                 view. I am not able to re-produce the error with a
>>>                 smaller problem either.
>>>                 >>>>>> In addition, I tried to use the block jacobi
>>>                 as the preconditioner with the same grid and same
>>>                 decomposition. The linear solver runs extremely slow
>>>                 but there is no memory error.
>>>                 >>>>>>
>>>                 >>>>>> How can I diagnose what exactly cause the error?
>>>                 >>>>>> Thank you so much.
>>>                 >>>>>>
>>>                 >>>>>> Frank
>>>                 >>>>>> <petsc_options.txt>
>>>                 >>>>>>
>>>                 <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>
>>>                 >>>>>>
>>>                 >>>>>
>>>                 >>>>
>>>                 >>>
>>>                 <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><memory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt>
>>>                 >
>>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160916/abb0082c/attachment-0001.html>
-------------- next part --------------
KSP Object: 32768 MPI processes
  type: cg
  maximum iterations=10000
  tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
  left preconditioning
  using nonzero initial guess
  using UNPRECONDITIONED norm type for convergence test
PC Object: 32768 MPI processes
  type: mg
    MG: type is MULTIPLICATIVE, levels=5 cycles=v
      Cycles per PCApply=1
      Using Galerkin computed coarse grid matrices
  Coarse grid solver -- level -------------------------------
    KSP Object: (mg_coarse_) 32768 MPI processes
      type: preonly
      maximum iterations=10000, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_coarse_) 32768 MPI processes
      type: telescope
        Telescope: parent comm size reduction factor = 64
        Telescope: comm_size = 32768 , subcomm_size = 512
        Telescope: subcomm type: interlaced
          Telescope: DMDA detected
        DMDA Object:    (mg_coarse_telescope_repart_)    512 MPI processes
          M 64 N 64 P 64 m 8 n 8 p 8 dof 1 overlap 1
        KSP Object: (mg_coarse_telescope_) 512 MPI processes
          type: preonly
          maximum iterations=10000, initial guess is zero
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
          left preconditioning
          using NONE norm type for convergence test
        PC Object: (mg_coarse_telescope_) 512 MPI processes
          type: mg
            MG: type is MULTIPLICATIVE, levels=3 cycles=v
              Cycles per PCApply=1
              Using Galerkin computed coarse grid matrices
          Coarse grid solver -- level -------------------------------
            KSP Object: (mg_coarse_telescope_mg_coarse_) 512 MPI processes
              type: preonly
              maximum iterations=10000, initial guess is zero
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using NONE norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_coarse_) 512 MPI processes
              type: redundant
                Redundant preconditioner: First (color=0) of 512 PCs follows
              linear system matrix = precond matrix:
              Mat Object: 512 MPI processes
                type: mpiaij
                rows=4096, cols=4096
                total: nonzeros=110592, allocated nonzeros=110592
                total number of mallocs used during MatSetValues calls =0
                  using I-node (on process 0) routines: found 2 nodes, limit used is 5
          Down solver (pre-smoother) on level 1 -------------------------------
            KSP Object: (mg_coarse_telescope_mg_levels_1_) 512 MPI processes
              type: richardson
                Richardson: damping factor=1.
              maximum iterations=1
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using nonzero initial guess
              using NONE norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_levels_1_) 512 MPI processes
              type: sor
                SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
              linear system matrix = precond matrix:
              Mat Object: 512 MPI processes
                type: mpiaij
                rows=32768, cols=32768
                total: nonzeros=884736, allocated nonzeros=884736
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Up solver (post-smoother) same as down solver (pre-smoother)
          Down solver (pre-smoother) on level 2 -------------------------------
            KSP Object: (mg_coarse_telescope_mg_levels_2_) 512 MPI processes
              type: richardson
                Richardson: damping factor=1.
              maximum iterations=1
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using nonzero initial guess
              using NONE norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_levels_2_) 512 MPI processes
              type: sor
                SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
              linear system matrix = precond matrix:
              Mat Object: 512 MPI processes
                type: mpiaij
                rows=262144, cols=262144
                total: nonzeros=7077888, allocated nonzeros=7077888
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Up solver (post-smoother) same as down solver (pre-smoother)
          linear system matrix = precond matrix:
          Mat Object: 512 MPI processes
            type: mpiaij
            rows=262144, cols=262144
            total: nonzeros=7077888, allocated nonzeros=7077888
            total number of mallocs used during MatSetValues calls =0
              not using I-node (on process 0) routines
                      KSP Object:       (mg_coarse_telescope_mg_coarse_redundant_)       1 MPI processes
                        type: preonly
                        maximum iterations=10000, initial guess is zero
                        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
                        left preconditioning
                        using NONE norm type for convergence test
                      PC Object:       (mg_coarse_telescope_mg_coarse_redundant_)       1 MPI processes
                        type: bjacobi
                          block Jacobi: number of blocks = 1
                          Local solve is same for all blocks, in the following KSP and PC objects:
                          KSP Object:       (mg_coarse_telescope_mg_coarse_redundant_sub_)       1 MPI processes
                            type: preonly
                            maximum iterations=10000, initial guess is zero
                            tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
                            left preconditioning
                            using NONE norm type for convergence test
                          PC Object:       (mg_coarse_telescope_mg_coarse_redundant_sub_)       1 MPI processes
                            type: ilu
                              ILU: out-of-place factorization
                              0 levels of fill
                              tolerance for zero pivot 2.22045e-14
                              matrix ordering: natural
                              factor fill ratio given 1., needed 1.
                                Factored matrix follows:
                                  Mat Object:       1 MPI processes
                                    type: seqaij
                                    rows=4096, cols=4096
                                    package used to perform factorization: petsc
                                    total: nonzeros=110592, allocated nonzeros=110592
                                    total number of mallocs used during MatSetValues calls =0
                                      not using I-node routines
                            linear system matrix = precond matrix:
                            Mat Object:       1 MPI processes
                              type: seqaij
                              rows=4096, cols=4096
                              total: nonzeros=110592, allocated nonzeros=110592
                              total number of mallocs used during MatSetValues calls =0
                                not using I-node routines
                        linear system matrix = precond matrix:
                        Mat Object:       1 MPI processes
                          type: seqaij
                          rows=4096, cols=4096
                          total: nonzeros=110592, allocated nonzeros=110592
                          total number of mallocs used during MatSetValues calls =0
                            not using I-node routines
      linear system matrix = precond matrix:
      Mat Object: 32768 MPI processes
        type: mpiaij
        rows=262144, cols=262144
        total: nonzeros=7077888, allocated nonzeros=7077888
        total number of mallocs used during MatSetValues calls =0
          using I-node (on process 0) routines: found 2 nodes, limit used is 5
  Down solver (pre-smoother) on level 1 -------------------------------
    KSP Object: (mg_levels_1_) 32768 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_1_) 32768 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 32768 MPI processes
        type: mpiaij
        rows=2097152, cols=2097152
        total: nonzeros=56623104, allocated nonzeros=56623104
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 2 -------------------------------
    KSP Object: (mg_levels_2_) 32768 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_2_) 32768 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 32768 MPI processes
        type: mpiaij
        rows=16777216, cols=16777216
        total: nonzeros=452984832, allocated nonzeros=452984832
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 3 -------------------------------
    KSP Object: (mg_levels_3_) 32768 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_3_) 32768 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 32768 MPI processes
        type: mpiaij
        rows=134217728, cols=134217728
        total: nonzeros=3623878656, allocated nonzeros=3623878656
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 4 -------------------------------
    KSP Object: (mg_levels_4_) 32768 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_4_) 32768 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 32768 MPI processes
        type: mpiaij
        rows=1073741824, cols=1073741824
        total: nonzeros=7516192768, allocated nonzeros=7516192768
        total number of mallocs used during MatSetValues calls =0
          has attached null space
  Up solver (post-smoother) same as down solver (pre-smoother)
  linear system matrix = precond matrix:
  Mat Object: 32768 MPI processes
    type: mpiaij
    rows=1073741824, cols=1073741824
    total: nonzeros=7516192768, allocated nonzeros=7516192768
    total number of mallocs used during MatSetValues calls =0
      has attached null space
-------------- next part --------------
32768 processors, by hengjie Fri Sep 16 04:29:10 2016
Using Petsc Development GIT revision: v3.7.3-1056-geeb1ceb  GIT Date: 2016-08-02 10:00:58 -0500

                         Max       Max/Min        Avg      Total 
Time (sec):           3.595e+01      1.00092   3.595e+01
Objects:              4.240e+02      1.61217   2.655e+02
Flops:                7.348e+07      1.09866   6.699e+07  2.195e+12
Flops/sec:            2.044e+06      1.09875   1.863e+06  6.106e+10
Memory:               1.110e+09      1.00000              3.636e+13
MPI Messages:         5.004e+04     11.27696   4.668e+03  1.530e+08
MPI Message Lengths:  4.805e+06      1.27794   8.088e+02  1.237e+11
MPI Reductions:       2.296e+03      1.48994

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 3.5947e+01 100.0%  2.1951e+12 100.0%  1.530e+08 100.0%  8.088e+02      100.0%  1.551e+03  67.5% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------

      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was compiled with a debugging option,      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################

Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecTDot               30 1.0 1.9905e-01 1.4 1.97e+06 1.0 0.0e+00 0.0e+00 6.0e+01  1  3  0  0  3   1  3  0  0  4 323650
VecNorm               16 1.0 3.9425e-01 3.5 1.05e+06 1.0 0.0e+00 0.0e+00 3.2e+01  1  2  0  0  1   1  2  0  0  2 87152
VecScale              75 1.7 2.3286e-02 2.0 4.52e+04 1.3 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 50363
VecCopy               17 1.0 3.8621e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               442 1.7 9.8095e-03 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               60 1.0 3.5868e-02 1.3 3.93e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  6  0  0  0   0  6  0  0  0 3592294
VecAYPX              119 1.3 1.7319e-02 1.3 1.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  3  0  0  0   0  3  0  0  0 3728684
VecAssemblyBegin       1 1.0 1.0757e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         1 1.0 2.7490e-04 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin      471 1.5 5.8588e-02 3.4 0.00e+00 0.0 1.2e+08 8.1e+02 0.0e+00  0  0 81 81  0   0  0 81 81  0     0
VecScatterEnd        471 1.5 1.2934e+00 6.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0   3  0  0  0  0     0
MatMult              135 1.3 2.8880e-01 1.4 2.33e+07 1.0 5.0e+07 1.7e+03 0.0e+00  1 34 32 66  0   1 34 32 66  0 2597254
MatMultAdd            90 1.5 1.1149e-01 2.9 3.85e+06 1.0 1.4e+07 3.2e+02 0.0e+00  0  6  9  4  0   0  6  9  4  0 1114404
MatMultTranspose     111 1.4 3.0435e-01 1.3 4.11e+06 1.0 1.7e+07 2.8e+02 8.0e+01  1  6 11  4  3   1  6 11  4  5 435479
MatSolve              15 0.0 2.0206e-02 0.0 3.26e+06 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 82513
MatSOR               180 1.5 9.9816e-01 1.3 2.32e+07 1.0 3.9e+07 2.4e+02 1.2e+00  2 33 25  8  0   2 33 25  8  0 727846
MatLUFactorNum         1 0.0 2.4225e-02 0.0 1.60e+06 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 33762
MatILUFactorSym        1 0.0 2.5048e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatConvert             1 0.0 7.5793e-04 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatResidual           90 1.5 3.7126e-01 1.2 1.11e+07 1.0 4.2e+07 8.0e+02 6.0e+01  1 16 27 27  3   1 16 27 27  4 942007
MatAssemblyBegin      33 1.4 7.2762e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 4.8e+01  2  0  0  0  2   2  0  0  0  3     0
MatAssemblyEnd        33 1.4 1.4643e+00 1.1 0.00e+00 0.0 1.1e+07 1.2e+02 2.5e+02  4  0  7  1 11   4  0  7  1 16     0
MatGetRowIJ            1 0.0 1.5974e-05 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetSubMatrice       2 2.0 3.4627e-01 3.7 0.00e+00 0.0 1.6e+05 5.4e+02 6.1e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 0.0 1.9929e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatView               13 2.2 1.0639e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01  0  0  0  0  1   0  0  0  0  1     0
MatPtAP                7 1.4 5.2281e+00 1.0 5.25e+06 1.0 2.4e+07 8.8e+02 2.1e+02 14  8 15 17  9  14  8 15 17 14 31939
MatPtAPSymbolic        7 1.4 4.0818e+00 1.0 0.00e+00 0.0 1.4e+07 1.1e+03 7.5e+01 11  0  9 12  3  11  0  9 12  5     0
MatPtAPNumeric         7 1.4 1.1755e+00 1.0 5.25e+06 1.0 9.6e+06 5.7e+02 1.4e+02  3  8  6  4  6   3  8  6  4  9 142046
MatRedundantMat        1 0.0 1.3647e-02 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 7.8e-02  0  0  0  0  0   0  0  0  0  0     0
MatMPIConcateSeq       1 0.0 2.7197e-01 0.0 0.00e+00 0.0 2.7e+04 4.0e+01 6.1e-01  0  0  0  0  0   0  0  0  0  0     0
MatGetLocalMat         7 1.4 1.3259e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetBrAoCol          7 1.4 6.9566e-02 2.8 0.00e+00 0.0 1.1e+07 1.1e+03 0.0e+00  0  0  7 10  0   0  0  7 10  0     0
MatGetSymTrans        14 1.4 2.2139e-02 5.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
DMCoarsen              6 1.5 3.3237e-01 1.1 0.00e+00 0.0 1.6e+06 1.7e+02 2.1e+02  1  0  1  0  9   1  0  1  0 13     0
DMCreateInterp         6 1.5 7.6958e-01 1.1 2.57e+05 1.0 2.8e+06 1.6e+02 2.0e+02  2  0  2  0  9   2  0  2  0 13 10763
KSPSetUp              12 2.0 1.1138e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.5e+01  0  0  0  0  2   0  0  0  0  2     0
KSPSolve               1 1.0 1.2628e+01 1.0 7.35e+07 1.1 1.5e+08 8.0e+02 1.4e+03 35100 99 99 59  35100 99 99 87 173826
PCSetUp                3 3.0 9.2140e+00 1.1 7.10e+06 1.3 2.9e+07 7.4e+02 7.9e+02 23  8 19 18 34  23  8 19 18 51 19110
PCSetUpOnBlocks       15 0.0 2.8822e-02 0.0 1.60e+06 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 28377
PCApply               15 1.0 3.5384e+00 1.0 5.58e+07 1.1 1.2e+08 6.3e+02 3.7e+02 10 74 79 62 16  10 74 79 62 24 457052
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Vector   197            197      4396000     0.
      Vector Scatter    27             27       333392     0.
              Matrix    66             66     14132608     0.
   Matrix Null Space     1              1          592     0.
    Distributed Mesh     8              8        40832     0.
Star Forest Bipartite Graph    16             16        13568     0.
     Discrete System     8              8         7008     0.
           Index Set    60             60       341672     0.
   IS L to G Mapping     8              8       195776     0.
       Krylov Solver    12             12        14760     0.
     DMKSP interface     6              6         3888     0.
      Preconditioner    12             12        11928     0.
              Viewer     3              2         1664     0.
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
Average time for MPI_Barrier(): 0.000146198
Average time for zero size MPI_Send(): 3.66852e-06