[petsc-users] Performance of the Telescope Multigrid Preconditioner

Tue Oct 4 14:09:27 CDT 2016

Hi,

On 10/04/2016 11:24 AM, Matthew Knepley wrote:

> On Tue, Oct 4, 2016 at 1:13 PM, frank <hengjiew at uci.edu 
> <mailto:hengjiew at uci.edu>> wrote:
>
>     Hi,
>
>     This question is follow-up of the thread "Question about memory
>     usage in Multigrid preconditioner".
>     I used to have the "Out of Memory(OOM)" problem when using the
>     CG+Telescope MG solver with 32768 cores. Adding the "-matrap 0;
>     -matptap_scalable" option did solve that problem.
>
>     Then I test the scalability by solving a 3d poisson eqn for 1
>     step. I used one sub-communicator in all the tests. The difference
>     between the petsc options in those tests are: 1 the
>     pc_telescope_reduction_factor; 2 the number of multigrid levels in
>     the up/down solver. The function "ksp_solve" is timed. It is kind
>     of slow and doesn't scale at all.
>
>
> 1) The number of levels cannot be different in the up/down smoothers. 
> Why are you using a / ?

  I didn't mean the "up/down smoothers". I mean the "-pc_mg_levels" and 
"-mg_coarse_telescope_pc_mg_levels".
>
> 2) We need to see what solver you actually constructed, so give us the 
> output of -ksp_view
>
> 3) For any performance questions, we need the output of -log_view

I attached the log_view's ouput for all the eight runs.
The file is named by the cores# and the grid size. Ex, log_512_4096.txt 
is log_view from the case using 512^3 grid points and 4096 cores.

I attach two ksp_view's output, just in case too many file become messy. 
The ksp_view for the other tests are quite similar. The only difference 
is the number of MG levels.
>
> 4) It looks like you are fixing the number of levels as you scale up. 
> This makes the coarse problem much bigger, and is not a scalable way 
> to proceed.
>     Have you looked at the ratio of coarse grid time to level time?

How can I find the ratio?
>
> 5) Did you look at the options in this paper: 
> https://arxiv.org/abs/1604.07163

I am going to look at it now

Thank you.
Frank

>
>   Thanks,
>
>      Matt
>
>     Test1: 512^3 grid points
>     Core#        telescope_reduction_factor     MG levels# for up/down
>     solver     Time for KSPSolve (s)
>     512             8 4 / 3 6.2466
>     4096           64 5 / 3 0.9361
>     32768         64 4 / 3 4.8914
>
>     Test2: 1024^3 grid points
>     Core#        telescope_reduction_factor     MG levels# for up/down
>     solver     Time for KSPSolve (s)
>     4096           64 5 / 4 3.4139
>     8192           128 5 / 4 2.4196
>     16384         32 5 / 3 5.4150
>     32768         64 5 / 3 5.6067
>     65536         128 5 / 3 6.5219
>
>     I guess I didn't set the MG levels properly. What would be the
>     efficient way to arrange the MG levels?
>     Also which preconditionr at the coarse mesh of the 2nd
>     communicator should I use to improve the performance?
>
>     I attached the test code and the petsc options file for the 1024^3
>     cube with 32768 cores.
>
>     Thank you.
>
>     Regards,
>     Frank
>
>
>
>
>
>
>     On 09/15/2016 03:35 AM, Dave May wrote:
>>     HI all,
>>
>>     I the only unexpected memory usage I can see is associated with
>>     the call to MatPtAP().
>>     Here is something you can try immediately.
>>     Run your code with the additional options
>>       -matrap 0 -matptap_scalable
>>
>>     I didn't realize this before, but the default behaviour of
>>     MatPtAP in parallel is actually to to explicitly form the
>>     transpose of P (e.g. assemble R = P^T) and then compute R.A.P.
>>     You don't want to do this. The option -matrap 0 resolves this issue.
>>
>>     The implementation of P^T.A.P has two variants.
>>     The scalable implementation (with respect to memory usage) is
>>     selected via the second option -matptap_scalable.
>>
>>     Try it out - I see a significant memory reduction using these
>>     options for particular mesh sizes / partitions.
>>
>>     I've attached a cleaned up version of the code you sent me.
>>     There were a number of memory leaks and other issues.
>>     The main points being
>>       * You should call DMDAVecGetArrayF90() before
>>     VecAssembly{Begin,End}
>>       * You should call PetscFinalize(), otherwise the option
>>     -log_summary (-log_view) will not display anything once the
>>     program has completed.
>>
>>
>>     Thanks,
>>       Dave
>>
>>
>>     On 15 September 2016 at 08:03, Hengjie Wang <hengjiew at uci.edu
>>     <mailto:hengjiew at uci.edu>> wrote:
>>
>>         Hi Dave,
>>
>>         Sorry, I should have put more comment to explain the code.
>>         The number of process in each dimension is the same: Px =
>>         Py=Pz=P. So is the domain size.
>>         So if the you want to run the code for a 512^3 grid points on
>>         16^3 cores, you need to set "-N 512 -P 16" in the command line.
>>         I add more comments and also fix an error in the attached
>>         code. ( The error only effects the accuracy of solution but
>>         not the memory usage. )
>>
>>         Thank you.
>>         Frank
>>
>>
>>         On 9/14/2016 9:05 PM, Dave May wrote:
>>>
>>>
>>>         On Thursday, 15 September 2016, Dave May
>>>         <dave.mayhem23 at gmail.com <mailto:dave.mayhem23 at gmail.com>>
>>>         wrote:
>>>
>>>
>>>
>>>             On Thursday, 15 September 2016, frank <hengjiew at uci.edu>
>>>             wrote:
>>>
>>>                 Hi,
>>>
>>>                 I write a simple code to re-produce the error. I
>>>                 hope this can help to diagnose the problem.
>>>                 The code just solves a 3d poisson equation.
>>>
>>>
>>>             Why is the stencil width a runtime parameter?? And why
>>>             is the default value 2? For 7-pnt FD Laplace, you only
>>>             need a stencil width of 1.
>>>
>>>             Was this choice made to mimic something in the
>>>             real application code?
>>>
>>>
>>>         Please ignore - I misunderstood your usage of the param set
>>>         by -P
>>>
>>>
>>>                 I run the code on a 1024^3 mesh. The process
>>>                 partition is 32 * 32 * 32. That's when I re-produce
>>>                 the OOM error. Each core has about 2G memory.
>>>                 I also run the code on a 512^3 mesh with 16 * 16 *
>>>                 16 processes. The ksp solver works fine.
>>>                 I attached the code, ksp_view_pre's output and my
>>>                 petsc option file.
>>>
>>>                 Thank you.
>>>                 Frank
>>>
>>>                 On 09/09/2016 06:38 PM, Hengjie Wang wrote:
>>>>                 Hi Barry,
>>>>
>>>>                 I checked. On the supercomputer, I had the option
>>>>                 "-ksp_view_pre" but it is not in file I sent you. I
>>>>                 am sorry for the confusion.
>>>>
>>>>                 Regards,
>>>>                 Frank
>>>>
>>>>                 On Friday, September 9, 2016, Barry Smith
>>>>                 <bsmith at mcs.anl.gov> wrote:
>>>>
>>>>
>>>>                     > On Sep 9, 2016, at 3:11 PM, frank
>>>>                     <hengjiew at uci.edu> wrote:
>>>>                     >
>>>>                     > Hi Barry,
>>>>                     >
>>>>                     > I think the first KSP view output is from
>>>>                     -ksp_view_pre. Before I submitted the test, I
>>>>                     was not sure whether there would be OOM error
>>>>                     or not. So I added both -ksp_view_pre and
>>>>                     -ksp_view.
>>>>
>>>>                       But the options file you sent specifically
>>>>                     does NOT list the -ksp_view_pre so how could it
>>>>                     be from that?
>>>>
>>>>                        Sorry to be pedantic but I've spent too much
>>>>                     time in the past trying to debug from incorrect
>>>>                     information and want to make sure that the
>>>>                     information I have is correct before thinking.
>>>>                     Please recheck exactly what happened. Rerun
>>>>                     with the exact input file you emailed if that
>>>>                     is needed.
>>>>
>>>>                        Barry
>>>>
>>>>                     >
>>>>                     > Frank
>>>>                     >
>>>>                     >
>>>>                     > On 09/09/2016 12:38 PM, Barry Smith wrote:
>>>>                     >>   Why does ksp_view2.txt have two KSP views
>>>>                     in it while ksp_view1.txt has only one KSPView
>>>>                     in it? Did you run two different solves in the
>>>>                     2 case but not the one?
>>>>                     >>
>>>>                     >>   Barry
>>>>                     >>
>>>>                     >>
>>>>                     >>
>>>>                     >>> On Sep 9, 2016, at 10:56 AM, frank
>>>>                     <hengjiew at uci.edu> wrote:
>>>>                     >>>
>>>>                     >>> Hi,
>>>>                     >>>
>>>>                     >>> I want to continue digging into the memory
>>>>                     problem here.
>>>>                     >>> I did find a work around in the past, which
>>>>                     is to use less cores per node so that each core
>>>>                     has 8G memory. However this is deficient and
>>>>                     expensive. I hope to locate the place that uses
>>>>                     the most memory.
>>>>                     >>>
>>>>                     >>> Here is a brief summary of the tests I did
>>>>                     in past:
>>>>                     >>>> Test1:  Mesh 1536*128*384  | Process Mesh
>>>>                     48*4*12
>>>>                     >>> Maximum (over computational time) process
>>>>                     memory:  total 7.0727e+08
>>>>                     >>> Current process memory:                  
>>>>                      total 7.0727e+08
>>>>                     >>> Maximum (over computational time) space
>>>>                     PetscMalloc()ed:  total 6.3908e+11
>>>>                     >>> Current space PetscMalloc()ed:            
>>>>                     total 1.8275e+09
>>>>                     >>>
>>>>                     >>>> Test2: Mesh 1536*128*384  | Process Mesh
>>>>                     96*8*24
>>>>                     >>> Maximum (over computational time) process
>>>>                     memory:  total 5.9431e+09
>>>>                     >>> Current process memory:                  
>>>>                      total 5.9431e+09
>>>>                     >>> Maximum (over computational time) space
>>>>                     PetscMalloc()ed:  total 5.3202e+12
>>>>                     >>> Current space PetscMalloc()ed:            
>>>>                      total 5.4844e+09
>>>>                     >>>
>>>>                     >>>> Test3: Mesh 3072*256*768  | Process Mesh
>>>>                     96*8*24
>>>>                     >>>     OOM( Out Of Memory ) killer of the
>>>>                     supercomputer terminated the job during "KSPSolve".
>>>>                     >>>
>>>>                     >>> I attached the output of ksp_view( the
>>>>                     third test's output is from ksp_view_pre ),
>>>>                     memory_view and also the petsc options.
>>>>                     >>>
>>>>                     >>> In all the tests, each core can access
>>>>                     about 2G memory. In test3, there are 4223139840
>>>>                     non-zeros in the matrix. This will consume
>>>>                     about 1.74M, using double precision.
>>>>                     Considering some extra memory used to store
>>>>                     integer index, 2G memory should still be way
>>>>                     enough.
>>>>                     >>>
>>>>                     >>> Is there a way to find out which part of
>>>>                     KSPSolve uses the most memory?
>>>>                     >>> Thank you so much.
>>>>                     >>>
>>>>                     >>> BTW, there are 4 options remains unused and
>>>>                     I don't understand why they are omitted:
>>>>                     >>> -mg_coarse_telescope_mg_coarse_ksp_type
>>>>                     value: preonly
>>>>                     >>> -mg_coarse_telescope_mg_coarse_pc_type
>>>>                     value: bjacobi
>>>>                     >>> -mg_coarse_telescope_mg_levels_ksp_max_it
>>>>                     value: 1
>>>>                     >>> -mg_coarse_telescope_mg_levels_ksp_type
>>>>                     value: richardson
>>>>                     >>>
>>>>                     >>>
>>>>                     >>> Regards,
>>>>                     >>> Frank
>>>>                     >>>
>>>>                     >>> On 07/13/2016 05:47 PM, Dave May wrote:
>>>>                     >>>>
>>>>                     >>>> On 14 July 2016 at 01:07, frank
>>>>                     <hengjiew at uci.edu> wrote:
>>>>                     >>>> Hi Dave,
>>>>                     >>>>
>>>>                     >>>> Sorry for the late reply.
>>>>                     >>>> Thank you so much for your detailed reply.
>>>>                     >>>>
>>>>                     >>>> I have a question about the estimation of
>>>>                     the memory usage. There are 4223139840
>>>>                     allocated non-zeros and 18432 MPI processes.
>>>>                     Double precision is used. So the memory per
>>>>                     process is:
>>>>                     >>>>  4223139840 * 8bytes / 18432 / 1024 / 1024
>>>>                     = 1.74M ?
>>>>                     >>>> Did I do sth wrong here? Because this
>>>>                     seems too small.
>>>>                     >>>>
>>>>                     >>>> No - I totally f***ed it up. You are
>>>>                     correct. That'll teach me for fumbling around
>>>>                     with my iphone calculator and not using my
>>>>                     brain. (Note that to convert to MB just divide
>>>>                     by 1e6, not 1024^2 - although I apparently
>>>>                     cannot convert between units correctly....)
>>>>                     >>>>
>>>>                     >>>> From the PETSc objects associated with the
>>>>                     solver, It looks like it _should_ run with 2GB
>>>>                     per MPI rank. Sorry for my mistake.
>>>>                     Possibilities are: somewhere in your usage of
>>>>                     PETSc you've introduced a memory leak; PETSc is
>>>>                     doing a huge over allocation (e.g. as per our
>>>>                     discussion of MatPtAP); or in your application
>>>>                     code there are other objects you have forgotten
>>>>                     to log the memory for.
>>>>                     >>>>
>>>>                     >>>>
>>>>                     >>>>
>>>>                     >>>> I am running this job on Bluewater
>>>>                     >>>> I am using the 7 points FD stencil in 3D.
>>>>                     >>>>
>>>>                     >>>> I thought so on both counts.
>>>>                     >>>>
>>>>                     >>>> I apologize that I made a stupid mistake
>>>>                     in computing the memory per core. My settings
>>>>                     render each core can access only 2G memory on
>>>>                     average instead of 8G which I mentioned in
>>>>                     previous email. I re-run the job with 8G memory
>>>>                     per core on average and there is no "Out Of
>>>>                     Memory" error. I would do more test to see if
>>>>                     there is still some memory issue.
>>>>                     >>>>
>>>>                     >>>> Ok. I'd still like to know where the
>>>>                     memory was being used since my estimates were off.
>>>>                     >>>>
>>>>                     >>>>
>>>>                     >>>> Thanks,
>>>>                     >>>>   Dave
>>>>                     >>>>
>>>>                     >>>> Regards,
>>>>                     >>>> Frank
>>>>                     >>>>
>>>>                     >>>>
>>>>                     >>>>
>>>>                     >>>> On 07/11/2016 01:18 PM, Dave May wrote:
>>>>                     >>>>> Hi Frank,
>>>>                     >>>>>
>>>>                     >>>>>
>>>>                     >>>>> On 11 July 2016 at 19:14, frank
>>>>                     <hengjiew at uci.edu> wrote:
>>>>                     >>>>> Hi Dave,
>>>>                     >>>>>
>>>>                     >>>>> I re-run the test using bjacobi as the
>>>>                     preconditioner on the coarse mesh of telescope.
>>>>                     The Grid is 3072*256*768 and process mesh is
>>>>                     96*8*24. The petsc option file is attached.
>>>>                     >>>>> I still got the "Out Of Memory" error.
>>>>                     The error occurred before the linear solver
>>>>                     finished one step. So I don't have the full
>>>>                     info from ksp_view. The info from ksp_view_pre
>>>>                     is attached.
>>>>                     >>>>>
>>>>                     >>>>> Okay - that is essentially useless (sorry)
>>>>                     >>>>>
>>>>                     >>>>> It seems to me that the error occurred
>>>>                     when the decomposition was going to be changed.
>>>>                     >>>>>
>>>>                     >>>>> Based on what information?
>>>>                     >>>>> Running with -info would give us more
>>>>                     clues, but will create a ton of output.
>>>>                     >>>>> Please try running the case which failed
>>>>                     with -info
>>>>                     >>>>>  I had another test with a grid of
>>>>                     1536*128*384 and the same process mesh as
>>>>                     above. There was no error. The ksp_view info is
>>>>                     attached for comparison.
>>>>                     >>>>> Thank you.
>>>>                     >>>>>
>>>>                     >>>>>
>>>>                     >>>>> [3] Here is my crude estimate of your
>>>>                     memory usage.
>>>>                     >>>>> I'll target the biggest memory hogs only
>>>>                     to get an order of magnitude estimate
>>>>                     >>>>>
>>>>                     >>>>> * The Fine grid operator contains
>>>>                     4223139840 non-zeros --> 1.8 GB per MPI rank
>>>>                     assuming double precision.
>>>>                     >>>>> The indices for the AIJ could amount to
>>>>                     another 0.3 GB (assuming 32 bit integers)
>>>>                     >>>>>
>>>>                     >>>>> * You use 5 levels of coarsening, so the
>>>>                     other operators should represent (collectively)
>>>>                     >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~
>>>>                     300 MB per MPI rank on the communicator with
>>>>                     18432 ranks.
>>>>                     >>>>> The coarse grid should consume ~ 0.5 MB
>>>>                     per MPI rank on the communicator with 18432 ranks.
>>>>                     >>>>>
>>>>                     >>>>> * You use a reduction factor of 64,
>>>>                     making the new communicator with 288 MPI ranks.
>>>>                     >>>>> PCTelescope will first gather a temporary
>>>>                     matrix associated with your coarse level
>>>>                     operator assuming a comm size of 288 living on
>>>>                     the comm with size 18432.
>>>>                     >>>>> This matrix will require approximately
>>>>                     0.5 * 64 = 32 MB per core on the 288 ranks.
>>>>                     >>>>> This matrix is then used to form a new
>>>>                     MPIAIJ matrix on the subcomm, thus require
>>>>                     another 32 MB per rank.
>>>>                     >>>>> The temporary matrix is now destroyed.
>>>>                     >>>>>
>>>>                     >>>>> * Because a DMDA is detected, a
>>>>                     permutation matrix is assembled.
>>>>                     >>>>> This requires 2 doubles per point in the
>>>>                     DMDA.
>>>>                     >>>>> Your coarse DMDA contains 92 x 16 x 48
>>>>                     points.
>>>>                     >>>>> Thus the permutation matrix will require
>>>>                     < 1 MB per MPI rank on the sub-comm.
>>>>                     >>>>>
>>>>                     >>>>> * Lastly, the matrix is permuted. This
>>>>                     uses MatPtAP(), but the resulting operator will
>>>>                     have the same memory footprint as the
>>>>                     unpermuted matrix (32 MB). At any stage in
>>>>                     PCTelescope, only 2 operators of size 32 MB are
>>>>                     held in memory when the DMDA is provided.
>>>>                     >>>>>
>>>>                     >>>>> From my rough estimates, the worst case
>>>>                     memory foot print for any given core, given
>>>>                     your options is approximately
>>>>                     >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB 
>>>>                     = 2465 MB
>>>>                     >>>>> This is way below 8 GB.
>>>>                     >>>>>
>>>>                     >>>>> Note this estimate completely ignores:
>>>>                     >>>>> (1) the memory required for the
>>>>                     restriction operator,
>>>>                     >>>>> (2) the potential growth in the number of
>>>>                     non-zeros per row due to Galerkin coarsening (I
>>>>                     wished -ksp_view_pre reported the output from
>>>>                     MatView so we could see the number of non-zeros
>>>>                     required by the coarse level operators)
>>>>                     >>>>> (3) all temporary vectors required by the
>>>>                     CG solver, and those required by the smoothers.
>>>>                     >>>>> (4) internal memory allocated by MatPtAP
>>>>                     >>>>> (5) memory associated with IS's used
>>>>                     within PCTelescope
>>>>                     >>>>>
>>>>                     >>>>> So either I am completely off in my
>>>>                     estimates, or you have not carefully estimated
>>>>                     the memory usage of your application code.
>>>>                     Hopefully others might examine/correct my rough
>>>>                     estimates
>>>>                     >>>>>
>>>>                     >>>>> Since I don't have your code I cannot
>>>>                     access the latter.
>>>>                     >>>>> Since I don't have access to the same
>>>>                     machine you are running on, I think we need to
>>>>                     take a step back.
>>>>                     >>>>>
>>>>                     >>>>> [1] What machine are you running on? Send
>>>>                     me a URL if its available
>>>>                     >>>>>
>>>>                     >>>>> [2] What discretization are you using? (I
>>>>                     am guessing a scalar 7 point FD stencil)
>>>>                     >>>>> If it's a 7 point FD stencil, we should
>>>>                     be able to examine the memory usage of your
>>>>                     solver configuration using a standard, light
>>>>                     weight existing PETSc example, run on your
>>>>                     machine at the same scale.
>>>>                     >>>>> This would hopefully enable us to
>>>>                     correctly evaluate the actual memory usage
>>>>                     required by the solver configuration you are using.
>>>>                     >>>>>
>>>>                     >>>>> Thanks,
>>>>                     >>>>>  Dave
>>>>                     >>>>>
>>>>                     >>>>>
>>>>                     >>>>> Frank
>>>>                     >>>>>
>>>>                     >>>>>
>>>>                     >>>>>
>>>>                     >>>>>
>>>>                     >>>>> On 07/08/2016 10:38 PM, Dave May wrote:
>>>>                     >>>>>>
>>>>                     >>>>>> On Saturday, 9 July 2016, frank
>>>>                     <hengjiew at uci.edu> wrote:
>>>>                     >>>>>> Hi Barry and Dave,
>>>>                     >>>>>>
>>>>                     >>>>>> Thank both of you for the advice.
>>>>                     >>>>>>
>>>>                     >>>>>> @Barry
>>>>                     >>>>>> I made a mistake in the file names in
>>>>                     last email. I attached the correct files this time.
>>>>                     >>>>>> For all the three tests, 'Telescope' is
>>>>                     used as the coarse preconditioner.
>>>>                     >>>>>>
>>>>                     >>>>>> == Test1:   Grid: 1536*128*384, 
>>>>                      Process Mesh: 48*4*12
>>>>                     >>>>>> Part of the memory usage: Vector   125
>>>>                     124 3971904     0.
>>>>                     >>>>>>                   Matrix  101 101     
>>>>                     9462372  0
>>>>                     >>>>>>
>>>>                     >>>>>> == Test2: Grid: 1536*128*384,   Process
>>>>                     Mesh: 96*8*24
>>>>                     >>>>>> Part of the memory usage: Vector   125
>>>>                     124 681672     0.
>>>>                     >>>>>>                   Matrix  101 101     
>>>>                     1462180  0.
>>>>                     >>>>>>
>>>>                     >>>>>> In theory, the memory usage in Test1
>>>>                     should be 8 times of Test2. In my case, it is
>>>>                     about 6 times.
>>>>                     >>>>>>
>>>>                     >>>>>> == Test3: Grid: 3072*256*768,   Process
>>>>                     Mesh: 96*8*24. Sub-domain per process: 32*32*32
>>>>                     >>>>>> Here I get the out of memory error.
>>>>                     >>>>>>
>>>>                     >>>>>> I tried to use -mg_coarse jacobi. In
>>>>                     this way, I don't need to set
>>>>                     -mg_coarse_ksp_type and -mg_coarse_pc_type
>>>>                     explicitly, right?
>>>>                     >>>>>> The linear solver didn't work in this
>>>>                     case. Petsc output some errors.
>>>>                     >>>>>>
>>>>                     >>>>>> @Dave
>>>>                     >>>>>> In test3, I use only one instance of
>>>>                     'Telescope'. On the coarse mesh of 'Telescope',
>>>>                     I used LU as the preconditioner instead of SVD.
>>>>                     >>>>>> If my set the levels correctly, then on
>>>>                     the last coarse mesh of MG where it calls
>>>>                     'Telescope', the sub-domain per process is 2*2*2.
>>>>                     >>>>>> On the last coarse mesh of 'Telescope',
>>>>                     there is only one grid point per process.
>>>>                     >>>>>> I still got the OOM error. The detailed
>>>>                     petsc option file is attached.
>>>>                     >>>>>>
>>>>                     >>>>>> Do you understand the expected memory
>>>>                     usage for the particular parallel LU
>>>>                     implementation you are using? I don't
>>>>                     (seriously). Replace LU with bjacobi and re-run
>>>>                     this test. My point about solver debugging is
>>>>                     still valid.
>>>>                     >>>>>>
>>>>                     >>>>>> And please send the result of KSPView so
>>>>                     we can see what is actually used in the
>>>>                     computations
>>>>                     >>>>>>
>>>>                     >>>>>> Thanks
>>>>                     >>>>>>  Dave
>>>>                     >>>>>>
>>>>                     >>>>>>
>>>>                     >>>>>> Thank you so much.
>>>>                     >>>>>>
>>>>                     >>>>>> Frank
>>>>                     >>>>>>
>>>>                     >>>>>>
>>>>                     >>>>>>
>>>>                     >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:
>>>>                     >>>>>> On Jul 6, 2016, at 4:19 PM, frank
>>>>                     <hengjiew at uci.edu> wrote:
>>>>                     >>>>>>
>>>>                     >>>>>> Hi Barry,
>>>>                     >>>>>>
>>>>                     >>>>>> Thank you for you advice.
>>>>                     >>>>>> I tried three test. In the 1st test, the
>>>>                     grid is 3072*256*768 and the process mesh is
>>>>                     96*8*24.
>>>>                     >>>>>> The linear solver is 'cg' the
>>>>                     preconditioner is 'mg' and 'telescope' is used
>>>>                     as the preconditioner at the coarse mesh.
>>>>                     >>>>>> The system gives me the "Out of Memory"
>>>>                     error before the linear system is completely
>>>>                     solved.
>>>>                     >>>>>> The info from '-ksp_view_pre' is
>>>>                     attached. I seems to me that the error occurs
>>>>                     when it reaches the coarse mesh.
>>>>                     >>>>>>
>>>>                     >>>>>> The 2nd test uses a grid of 1536*128*384
>>>>                     and process mesh is 96*8*24. The 3rd  test uses
>>>>                     the same grid but a different process mesh 48*4*12.
>>>>                     >>>>>>    Are you sure this is right? The total
>>>>                     matrix and vector memory usage goes from 2nd test
>>>>                     >>>>>>               Vector  384            383
>>>>                     8,193,712     0.
>>>>                     >>>>>>               Matrix  103            103
>>>>                      11,508,688     0.
>>>>                     >>>>>> to 3rd test
>>>>                     >>>>>>              Vector   384           383
>>>>                     1,590,520     0.
>>>>                     >>>>>>               Matrix  103            103
>>>>                     3,508,664     0.
>>>>                     >>>>>> that is the memory usage got smaller but
>>>>                     if you have only 1/8th the processes and the
>>>>                     same grid it should have gotten about 8 times
>>>>                     bigger. Did you maybe cut the grid by a factor
>>>>                     of 8 also? If so that still doesn't explain it
>>>>                     because the memory usage changed by a factor of
>>>>                     5 something for the vectors and 3 something for
>>>>                     the matrices.
>>>>                     >>>>>>
>>>>                     >>>>>>
>>>>                     >>>>>> The linear solver and petsc options in
>>>>                     2nd and 3rd tests are the same in 1st test. The
>>>>                     linear solver works fine in both test.
>>>>                     >>>>>> I attached the memory usage of the 2nd
>>>>                     and 3rd tests. The memory info is from the
>>>>                     option '-log_summary'. I tried to use
>>>>                     '-momery_info' as you suggested, but in my case
>>>>                     petsc treated it as an unused option. It output
>>>>                     nothing about the memory. Do I need to add sth
>>>>                     to my code so I can use '-memory_info'?
>>>>                     >>>>>>    Sorry, my mistake the option is
>>>>                     -memory_view
>>>>                     >>>>>>
>>>>                     >>>>>>   Can you run the one case with
>>>>                     -memory_view and -mg_coarse jacobi -ksp_max_it
>>>>                     1 (just so it doesn't iterate forever) to see
>>>>                     how much memory is used without the telescope?
>>>>                     Also run case 2 the same way.
>>>>                     >>>>>>
>>>>                     >>>>>>   Barry
>>>>                     >>>>>>
>>>>                     >>>>>>
>>>>                     >>>>>>
>>>>                     >>>>>> In both tests the memory usage is not large.
>>>>                     >>>>>>
>>>>                     >>>>>> It seems to me that it might be the
>>>>                     'telescope' preconditioner that allocated a lot
>>>>                     of memory and caused the error in the 1st test.
>>>>                     >>>>>> Is there is a way to show how much
>>>>                     memory it allocated?
>>>>                     >>>>>>
>>>>                     >>>>>> Frank
>>>>                     >>>>>>
>>>>                     >>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote:
>>>>                     >>>>>>   Frank,
>>>>                     >>>>>>
>>>>                     >>>>>>     You can run with -ksp_view_pre to
>>>>                     have it "view" the KSP before the solve so
>>>>                     hopefully it gets that far.
>>>>                     >>>>>>
>>>>                     >>>>>>      Please run the problem that does
>>>>                     fit with -memory_info when the problem
>>>>                     completes it will show the "high water mark"
>>>>                     for PETSc allocated memory and total memory
>>>>                     used. We first want to look at these numbers to
>>>>                     see if it is using more memory than you expect.
>>>>                     You could also run with say half the grid
>>>>                     spacing to see how the memory usage scaled with
>>>>                     the increase in grid points. Make the runs also
>>>>                     with -log_view and send all the output from
>>>>                     these options.
>>>>                     >>>>>>
>>>>                     >>>>>>    Barry
>>>>                     >>>>>>
>>>>                     >>>>>> On Jul 5, 2016, at 5:23 PM, frank
>>>>                     <hengjiew at uci.edu> wrote:
>>>>                     >>>>>>
>>>>                     >>>>>> Hi,
>>>>                     >>>>>>
>>>>                     >>>>>> I am using the CG ksp solver and
>>>>                     Multigrid preconditioner  to solve a linear
>>>>                     system in parallel.
>>>>                     >>>>>> I chose to use the 'Telescope' as the
>>>>                     preconditioner on the coarse mesh for its good
>>>>                     performance.
>>>>                     >>>>>> The petsc options file is attached.
>>>>                     >>>>>>
>>>>                     >>>>>> The domain is a 3d box.
>>>>                     >>>>>> It works well when the grid is 
>>>>                     1536*128*384 and the process mesh is 96*8*24.
>>>>                     When I double the size of grid and            
>>>>                      keep the same process mesh and petsc options,
>>>>                     I get an "out of memory" error from the
>>>>                     super-cluster I am using.
>>>>                     >>>>>> Each process has access to at least 8G
>>>>                     memory, which should be more than enough for my
>>>>                     application. I am sure that all the other parts
>>>>                     of my code( except the linear solver ) do not
>>>>                     use much memory. So I doubt if there is
>>>>                     something wrong with the linear solver.
>>>>                     >>>>>> The error occurs before the linear
>>>>                     system is completely solved so I don't have the
>>>>                     info from ksp view. I am not able to re-produce
>>>>                     the error with a smaller problem either.
>>>>                     >>>>>> In addition,  I tried to use the block
>>>>                     jacobi as the preconditioner with the same grid
>>>>                     and same decomposition. The linear solver runs
>>>>                     extremely slow but there is no memory error.
>>>>                     >>>>>>
>>>>                     >>>>>> How can I diagnose what exactly cause
>>>>                     the error?
>>>>                     >>>>>> Thank you so much.
>>>>                     >>>>>>
>>>>                     >>>>>> Frank
>>>>                     >>>>>> <petsc_options.txt>
>>>>                     >>>>>>
>>>>                     <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>
>>>>                     >>>>>>
>>>>                     >>>>>
>>>>                     >>>>
>>>>                     >>>
>>>>                     <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><memory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt>
>>>>                     >
>>>>
>>>
>>
>>
>
>
>
>
> -- 
> What most experimenters take for granted before they begin their 
> experiments is infinitely more interesting than any results to which 
> their experiments lead.
> -- Norbert Wiener

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161004/197a8ded/attachment-0001.html>
-------------- next part --------------
Linear solve converged due to CONVERGED_RTOL iterations 7
KSP Object: 4096 MPI processes
  type: cg
  maximum iterations=10000
  tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
  left preconditioning
  using nonzero initial guess
  using UNPRECONDITIONED norm type for convergence test
PC Object: 4096 MPI processes
  type: mg
    MG: type is MULTIPLICATIVE, levels=5 cycles=v
      Cycles per PCApply=1
      Using Galerkin computed coarse grid matrices
  Coarse grid solver -- level -------------------------------
    KSP Object:    (mg_coarse_)     4096 MPI processes
      type: preonly
      maximum iterations=1, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object:    (mg_coarse_)     4096 MPI processes
      type: telescope
        Telescope: parent comm size reduction factor = 64
        Telescope: comm_size = 4096 , subcomm_size = 64
          Telescope: DMDA detected
        DMDA Object:    (repart_)    64 MPI processes
          M 32 N 32 P 32 m 4 n 4 p 4 dof 1 overlap 1
        KSP Object:        (mg_coarse_telescope_)         64 MPI processes
          type: preonly
          maximum iterations=10000, initial guess is zero
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
          left preconditioning
          using NONE norm type for convergence test
        PC Object:        (mg_coarse_telescope_)         64 MPI processes
          type: mg
            MG: type is MULTIPLICATIVE, levels=3 cycles=v
              Cycles per PCApply=1
              Using Galerkin computed coarse grid matrices
          Coarse grid solver -- level -------------------------------
            KSP Object:            (mg_coarse_telescope_mg_coarse_)             64 MPI processes
              type: preonly
              maximum iterations=1, initial guess is zero
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using NONE norm type for convergence test
            PC Object:            (mg_coarse_telescope_mg_coarse_)             64 MPI processes
              type: redundant
                Redundant preconditioner: First (color=0) of 64 PCs follows
              linear system matrix = precond matrix:
              Mat Object:               64 MPI processes
                type: mpiaij
                rows=512, cols=512
                total: nonzeros=13824, allocated nonzeros=13824
                total number of mallocs used during MatSetValues calls =0
                  using I-node (on process 0) routines: found 2 nodes, limit used is 5
          Down solver (pre-smoother) on level 1 -------------------------------
            KSP Object:            (mg_coarse_telescope_mg_levels_1_)             64 MPI processes
              type: richardson
                Richardson: damping factor=1.
              maximum iterations=1
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using nonzero initial guess
              using NONE norm type for convergence test
            PC Object:            (mg_coarse_telescope_mg_levels_1_)             64 MPI processes
              type: sor
                SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
              linear system matrix = precond matrix:
              Mat Object:               64 MPI processes
                type: mpiaij
                rows=4096, cols=4096
                total: nonzeros=110592, allocated nonzeros=110592
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Up solver (post-smoother) same as down solver (pre-smoother)
          Down solver (pre-smoother) on level 2 -------------------------------
            KSP Object:            (mg_coarse_telescope_mg_levels_2_)             64 MPI processes
              type: richardson
                Richardson: damping factor=1.
              maximum iterations=1
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using nonzero initial guess
              using NONE norm type for convergence test
            PC Object:            (mg_coarse_telescope_mg_levels_2_)             64 MPI processes
              type: sor
                SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
              linear system matrix = precond matrix:
              Mat Object:               64 MPI processes
                type: mpiaij
                rows=32768, cols=32768
                total: nonzeros=884736, allocated nonzeros=884736
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Up solver (post-smoother) same as down solver (pre-smoother)
          linear system matrix = precond matrix:
          Mat Object:           64 MPI processes
            type: mpiaij
            rows=32768, cols=32768
            total: nonzeros=884736, allocated nonzeros=884736
            total number of mallocs used during MatSetValues calls =0
              not using I-node (on process 0) routines
                      KSP Object:                      (mg_coarse_telescope_mg_coarse_redundant_)                       1 MPI processes
                        type: preonly
                        maximum iterations=10000, initial guess is zero
                        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
                        left preconditioning
                        using NONE norm type for convergence test
                      PC Object:                      (mg_coarse_telescope_mg_coarse_redundant_)                       1 MPI processes
                        type: lu
                          LU: out-of-place factorization
                          tolerance for zero pivot 2.22045e-14
                          using diagonal shift on blocks to prevent zero pivot [INBLOCKS]
                          matrix ordering: nd
                          factor fill ratio given 5., needed 8.69575
                            Factored matrix follows:
                              Mat Object:                               1 MPI processes
                                type: seqaij
                                rows=512, cols=512
                                package used to perform factorization: petsc
                                total: nonzeros=120210, allocated nonzeros=120210
                                total number of mallocs used during MatSetValues calls =0
                                  not using I-node routines
                        linear system matrix = precond matrix:
                        Mat Object:                         1 MPI processes
                          type: seqaij
                          rows=512, cols=512
                          total: nonzeros=13824, allocated nonzeros=13824
                          total number of mallocs used during MatSetValues calls =0
                            not using I-node routines
      linear system matrix = precond matrix:
      Mat Object:       4096 MPI processes
        type: mpiaij
        rows=32768, cols=32768
        total: nonzeros=884736, allocated nonzeros=884736
        total number of mallocs used during MatSetValues calls =0
          using I-node (on process 0) routines: found 2 nodes, limit used is 5
  Down solver (pre-smoother) on level 1 -------------------------------
    KSP Object:    (mg_levels_1_)     4096 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_1_)     4096 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object:       4096 MPI processes
        type: mpiaij
        rows=262144, cols=262144
        total: nonzeros=7077888, allocated nonzeros=7077888
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 2 -------------------------------
    KSP Object:    (mg_levels_2_)     4096 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_2_)     4096 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object:       4096 MPI processes
        type: mpiaij
        rows=2097152, cols=2097152
        total: nonzeros=56623104, allocated nonzeros=56623104
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 3 -------------------------------
    KSP Object:    (mg_levels_3_)     4096 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_3_)     4096 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object:       4096 MPI processes
        type: mpiaij
        rows=16777216, cols=16777216
        total: nonzeros=452984832, allocated nonzeros=452984832
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 4 -------------------------------
    KSP Object:    (mg_levels_4_)     4096 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_4_)     4096 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object:       4096 MPI processes
        type: mpiaij
        rows=134217728, cols=134217728
        total: nonzeros=939524096, allocated nonzeros=939524096
        total number of mallocs used during MatSetValues calls =0
          has attached null space
  Up solver (post-smoother) same as down solver (pre-smoother)
  linear system matrix = precond matrix:
  Mat Object:   4096 MPI processes
    type: mpiaij
    rows=134217728, cols=134217728
    total: nonzeros=939524096, allocated nonzeros=939524096
    total number of mallocs used during MatSetValues calls =0
      has attached null space
-------------- next part --------------
Linear solve converged due to CONVERGED_RTOL iterations 8
KSP Object: 8192 MPI processes
  type: cg
  maximum iterations=10000
  tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
  left preconditioning
  using nonzero initial guess
  using UNPRECONDITIONED norm type for convergence test
PC Object: 8192 MPI processes
  type: mg
    MG: type is MULTIPLICATIVE, levels=5 cycles=v
      Cycles per PCApply=1
      Using Galerkin computed coarse grid matrices
  Coarse grid solver -- level -------------------------------
    KSP Object:    (mg_coarse_)     8192 MPI processes
      type: preonly
      maximum iterations=1, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object:    (mg_coarse_)     8192 MPI processes
      type: telescope
        Telescope: parent comm size reduction factor = 128
        Telescope: comm_size = 8192 , subcomm_size = 64
          Telescope: DMDA detected
        DMDA Object:    (repart_)    64 MPI processes
          M 64 N 64 P 64 m 4 n 4 p 4 dof 1 overlap 1
        KSP Object:        (mg_coarse_telescope_)         64 MPI processes
          type: preonly
          maximum iterations=10000, initial guess is zero
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
          left preconditioning
          using NONE norm type for convergence test
        PC Object:        (mg_coarse_telescope_)         64 MPI processes
          type: mg
            MG: type is MULTIPLICATIVE, levels=4 cycles=v
              Cycles per PCApply=1
              Using Galerkin computed coarse grid matrices
          Coarse grid solver -- level -------------------------------
            KSP Object:            (mg_coarse_telescope_mg_coarse_)             64 MPI processes
              type: preonly
              maximum iterations=1, initial guess is zero
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using NONE norm type for convergence test
            PC Object:            (mg_coarse_telescope_mg_coarse_)             64 MPI processes
              type: redundant
                Redundant preconditioner: First (color=0) of 64 PCs follows
              linear system matrix = precond matrix:
              Mat Object:               64 MPI processes
                type: mpiaij
                rows=512, cols=512
                total: nonzeros=13824, allocated nonzeros=13824
                total number of mallocs used during MatSetValues calls =0
                  using I-node (on process 0) routines: found 2 nodes, limit used is 5
          Down solver (pre-smoother) on level 1 -------------------------------
            KSP Object:            (mg_coarse_telescope_mg_levels_1_)             64 MPI processes
              type: richardson
                Richardson: damping factor=1.
              maximum iterations=1
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using nonzero initial guess
              using NONE norm type for convergence test
            PC Object:            (mg_coarse_telescope_mg_levels_1_)             64 MPI processes
              type: sor
                SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
              linear system matrix = precond matrix:
              Mat Object:               64 MPI processes
                type: mpiaij
                rows=4096, cols=4096
                total: nonzeros=110592, allocated nonzeros=110592
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Up solver (post-smoother) same as down solver (pre-smoother)
          Down solver (pre-smoother) on level 2 -------------------------------
            KSP Object:            (mg_coarse_telescope_mg_levels_2_)             64 MPI processes
              type: richardson
                Richardson: damping factor=1.
              maximum iterations=1
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using nonzero initial guess
              using NONE norm type for convergence test
            PC Object:            (mg_coarse_telescope_mg_levels_2_)             64 MPI processes
              type: sor
                SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
              linear system matrix = precond matrix:
              Mat Object:               64 MPI processes
                type: mpiaij
                rows=32768, cols=32768
                total: nonzeros=884736, allocated nonzeros=884736
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Up solver (post-smoother) same as down solver (pre-smoother)
          Down solver (pre-smoother) on level 3 -------------------------------
            KSP Object:            (mg_coarse_telescope_mg_levels_3_)             64 MPI processes
              type: richardson
                Richardson: damping factor=1.
              maximum iterations=1
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using nonzero initial guess
              using NONE norm type for convergence test
            PC Object:            (mg_coarse_telescope_mg_levels_3_)             64 MPI processes
              type: sor
                SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
              linear system matrix = precond matrix:
              Mat Object:               64 MPI processes
                type: mpiaij
                rows=262144, cols=262144
                total: nonzeros=7077888, allocated nonzeros=7077888
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Up solver (post-smoother) same as down solver (pre-smoother)
          linear system matrix = precond matrix:
          Mat Object:           64 MPI processes
            type: mpiaij
            rows=262144, cols=262144
            total: nonzeros=7077888, allocated nonzeros=7077888
            total number of mallocs used during MatSetValues calls =0
              not using I-node (on process 0) routines
                      KSP Object:                      (mg_coarse_telescope_mg_coarse_redundant_)                       1 MPI processes
                        type: preonly
                        maximum iterations=10000, initial guess is zero
                        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
                        left preconditioning
                        using NONE norm type for convergence test
                      PC Object:                      (mg_coarse_telescope_mg_coarse_redundant_)                       1 MPI processes
                        type: lu
                          LU: out-of-place factorization
                          tolerance for zero pivot 2.22045e-14
                          using diagonal shift on blocks to prevent zero pivot [INBLOCKS]
                          matrix ordering: nd
                          factor fill ratio given 5., needed 8.69575
                            Factored matrix follows:
                              Mat Object:                               1 MPI processes
                                type: seqaij
                                rows=512, cols=512
                                package used to perform factorization: petsc
                                total: nonzeros=120210, allocated nonzeros=120210
                                total number of mallocs used during MatSetValues calls =0
                                  not using I-node routines
                        linear system matrix = precond matrix:
                        Mat Object:                         1 MPI processes
                          type: seqaij
                          rows=512, cols=512
                          total: nonzeros=13824, allocated nonzeros=13824
                          total number of mallocs used during MatSetValues calls =0
                            not using I-node routines
      linear system matrix = precond matrix:
      Mat Object:       8192 MPI processes
        type: mpiaij
        rows=262144, cols=262144
        total: nonzeros=7077888, allocated nonzeros=7077888
        total number of mallocs used during MatSetValues calls =0
          using I-node (on process 0) routines: found 16 nodes, limit used is 5
  Down solver (pre-smoother) on level 1 -------------------------------
    KSP Object:    (mg_levels_1_)     8192 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_1_)     8192 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object:       8192 MPI processes
        type: mpiaij
        rows=2097152, cols=2097152
        total: nonzeros=56623104, allocated nonzeros=56623104
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 2 -------------------------------
    KSP Object:    (mg_levels_2_)     8192 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_2_)     8192 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object:       8192 MPI processes
        type: mpiaij
        rows=16777216, cols=16777216
        total: nonzeros=452984832, allocated nonzeros=452984832
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 3 -------------------------------
    KSP Object:    (mg_levels_3_)     8192 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_3_)     8192 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object:       8192 MPI processes
        type: mpiaij
        rows=134217728, cols=134217728
        total: nonzeros=3623878656, allocated nonzeros=3623878656
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 4 -------------------------------
    KSP Object:    (mg_levels_4_)     8192 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object:    (mg_levels_4_)     8192 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
      linear system matrix = precond matrix:
      Mat Object:       8192 MPI processes
        type: mpiaij
        rows=1073741824, cols=1073741824
        total: nonzeros=7516192768, allocated nonzeros=7516192768
        total number of mallocs used during MatSetValues calls =0
          has attached null space
  Up solver (post-smoother) same as down solver (pre-smoother)
  linear system matrix = precond matrix:
  Mat Object:   8192 MPI processes
    type: mpiaij
    rows=1073741824, cols=1073741824
    total: nonzeros=7516192768, allocated nonzeros=7516192768
    total number of mallocs used during MatSetValues calls =0
      has attached null space
-------------- next part --------------
Linear solve converged due to CONVERGED_RTOL iterations 7
 1 step time:    6.2466299533843994     
 norm1 error:      1.2135791829058829E-005
 norm inf error:   1.0512737852365958E-002
Summary of Memory Usage in PETSc
Maximum (over computational time) process memory:        total 8.0407e+07 max 1.9696e+05 min 1.5078e+05
Current process memory:                                  total 8.0407e+07 max 1.9696e+05 min 1.5078e+05
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./test_ksp.exe                                                                                                                                                                                                                                                   on a gnu-opt named . with 512 processors, by wang11 Tue Oct  4 05:04:05 2016
Using Petsc Development GIT revision: v3.6.3-2059-geab7831  GIT Date: 2016-01-20 10:58:35 -0600

                         Max       Max/Min        Avg      Total 
Time (sec):           7.128e+00      1.00215   7.121e+00
Objects:              3.330e+02      1.72539   2.105e+02
Flops:                2.508e+09      9.15893   5.530e+08  2.832e+11
Flops/sec:            3.521e+08      9.16346   7.765e+07  3.976e+10
MPI Messages:         3.918e+03      2.07713   2.157e+03  1.104e+06
MPI Message Lengths:  1.003e+07      1.17554   4.064e+03  4.488e+09
MPI Reductions:       4.310e+02      1.60223

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 7.1208e+00 100.0%  2.8316e+11 100.0%  1.104e+06 100.0%  4.064e+03      100.0%  2.882e+02  66.9% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSidedF         1 1.0 2.5056e-0217.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecTDot               14 1.0 6.0542e-02 1.6 7.34e+06 1.0 0.0e+00 0.0e+00 1.4e+01  1  1  0  0  3   1  1  0  0  5 62074
VecNorm                8 1.0 3.5572e-02 3.1 4.19e+06 1.0 0.0e+00 0.0e+00 8.0e+00  0  1  0  0  2   0  1  0  0  3 60370
VecScale              28 2.0 2.1243e-04 1.8 7.35e+04 1.3 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 144250
VecCopy                9 1.0 3.8947e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               193 1.8 1.6343e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               28 1.0 1.0030e-01 1.1 1.47e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  3  0  0  0   1  3  0  0  0 74940
VecAYPX               48 1.4 6.3155e-02 1.6 7.11e+06 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0 57380
VecAssemblyBegin       1 1.0 2.5080e-0217.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         1 1.0 2.2888e-0512.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin      194 1.6 3.9131e-02 1.6 0.00e+00 0.0 7.2e+05 4.1e+03 0.0e+00  0  0 65 65  0   0  0 65 65  0     0
VecScatterEnd        194 1.6 3.4133e+0068.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 42  0  0  0  0  42  0  0  0  0     0
MatMult               56 1.3 5.0448e-01 1.2 8.70e+07 1.0 2.9e+05 8.2e+03 0.0e+00  6 15 26 53  0   6 15 26 53  0 86737
MatMultAdd            35 1.7 8.0332e-02 1.2 1.43e+07 1.0 8.2e+04 1.5e+03 0.0e+00  1  3  7  3  0   1  3  7  3  0 90220
MatMultTranspose      47 1.5 1.1686e-01 1.4 1.64e+07 1.0 1.1e+05 1.4e+03 0.0e+00  1  3 10  3  0   1  3 10  3  0 70913
MatSolve               7 0.0 5.4884e-02 0.0 4.38e+07 0.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0 51106
MatSOR                70 1.7 7.4662e-01 1.1 8.85e+07 1.0 2.1e+05 1.2e+03 1.8e+00 10 15 19  5  0  10 15 19  5  1 58271
MatLUFactorSym         1 0.0 1.3002e-01 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatLUFactorNum         1 0.0 3.0343e+00 0.0 2.18e+09 0.0 0.0e+00 0.0e+00 0.0e+00  5 49  0  0  0   5 49  0  0  0 46035
MatConvert             1 0.0 1.4801e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatResidual           35 1.7 2.5246e-01 1.3 4.14e+07 1.0 2.3e+05 4.1e+03 0.0e+00  3  7 21 21  0   3  7 21 21  0 80802
MatAssemblyBegin      29 1.5 6.2687e-02 2.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.1e+01  1  0  0  0  5   1  0  0  0  7     0
MatAssemblyEnd        29 1.5 2.8406e-01 1.0 0.00e+00 0.0 1.5e+05 5.4e+02 7.7e+01  4  0 14  2 18   4  0 14  2 27     0
MatGetRowIJ            1 0.0 1.1208e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetSubMatrice       2 2.0 4.1284e-02 9.3 0.00e+00 0.0 2.2e+03 3.4e+04 3.5e+00  0  0  0  2  1   0  0  0  2  1     0
MatGetOrdering         1 0.0 7.9041e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatPtAP                6 1.5 1.0306e+00 1.0 4.18e+07 1.0 3.1e+05 4.4e+03 7.2e+01 14  7 28 30 17  14  7 28 30 25 20208
MatPtAPSymbolic        6 1.5 4.9107e-01 1.0 0.00e+00 0.0 1.8e+05 5.3e+03 3.0e+01  7  0 16 21  7   7  0 16 21 10     0
MatPtAPNumeric         6 1.5 5.3958e-01 1.0 4.18e+07 1.0 1.3e+05 3.0e+03 4.2e+01  7  7 11  9 10   7  7 11  9 15 38597
MatRedundantMat        1 0.0 2.7650e-02 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 5.0e-01  0  0  0  0  0   0  0  0  0  0     0
MatMPIConcateSeq       1 0.0 1.6951e-02 0.0 0.00e+00 0.0 3.3e+03 1.4e+02 1.9e+00  0  0  0  0  0   0  0  0  0  1     0
MatGetLocalMat         6 1.5 4.7763e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
MatGetBrAoCol          6 1.5 4.1229e-02 1.2 0.00e+00 0.0 1.4e+05 5.5e+03 0.0e+00  1  0 13 17  0   1  0 13 17  0     0
MatGetSymTrans        12 1.5 1.4412e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
DMCoarsen              5 1.7 8.8470e-03 1.4 0.00e+00 0.0 2.0e+04 8.4e+02 3.6e+01  0  0  2  0  8   0  0  2  0 12     0
DMCreateInterpolation       5 1.7 2.1848e-01 1.0 2.05e+06 1.0 3.5e+04 7.5e+02 5.2e+01  3  0  3  1 12   3  0  3  1 18  4739
KSPSetUp              10 2.0 1.9465e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01  0  0  0  0  3   0  0  0  0  4     0
KSPSolve               1 1.0 6.2467e+00 1.0 2.51e+09 9.2 1.1e+06 4.0e+03 2.6e+02 88100 99 98 60  88100 99 98 90 45330
PCSetUp                2 2.0 4.5211e+00 3.6 2.23e+0952.3 3.8e+05 3.8e+03 2.1e+02 23 57 35 33 48  23 57 35 33 72 35732
PCApply                7 1.0 4.6845e+00 1.0 2.42e+0913.0 7.2e+05 3.1e+03 3.0e+01 66 84 65 50  7  66 84 65 50 11 50783
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Vector   133            133     29053936     0.
      Vector Scatter    24             24      2464384     0.
              Matrix    58             58    118369764     0.
   Matrix Null Space     1              1          592     0.
    Distributed Mesh     7              7        34944     0.
Star Forest Bipartite Graph    14             14        11872     0.
     Discrete System     7              7         5992     0.
           Index Set    54             54      1628276     0.
   IS L to G Mapping     7              7      1367088     0.
       Krylov Solver    11             11        13640     0.
     DMKSP interface     5              5         3240     0.
      Preconditioner    11             11        11008     0.
              Viewer     1              0            0     0.
========================================================================================================================
Average time to get PetscTime(): 1.90735e-07
Average time for MPI_Barrier(): 1.87874e-05
Average time for zero size MPI_Send(): 1.10432e-05
#PETSc Option Table entries:
-ksp_converged_reason
-ksp_initial_guess_nonzero yes
-ksp_norm_type unpreconditioned
-ksp_rtol 1e-7
-ksp_type cg
-log_view
-matptap_scalable
-matrap 0
-memory_view
-mg_coarse_ksp_type preonly
-mg_coarse_pc_telescope_reduction_factor 8
-mg_coarse_pc_type telescope
-mg_coarse_telescope_ksp_type preonly
-mg_coarse_telescope_mg_coarse_ksp_type preonly
-mg_coarse_telescope_mg_coarse_pc_type redundant
-mg_coarse_telescope_mg_levels_ksp_max_it 1
-mg_coarse_telescope_mg_levels_ksp_type richardson
-mg_coarse_telescope_pc_mg_galerkin
-mg_coarse_telescope_pc_mg_levels 3
-mg_coarse_telescope_pc_type mg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_type richardson
-N 512
-options_left 1
-pc_mg_galerkin
-pc_mg_levels 4
-pc_type mg
-ppe_max_iter 20
-px 8
-py 8
-pz 8
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --known-level1-dcache-size=16384 --known-level1-dcache-linesize=64 --known-level1-dcache-assoc=4 --known-sizeof-char=1 --known-sizeof-void-p=8 --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8 --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8 --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-memcmp-ok=1 --known-sizeof-MPI_Comm=4 --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --known-mpi-int64_t=1 --known-mpi-c-double-complex=1 --known-sdot-returns-double=0 --known-snrm2-returns-double=0 --known-has-attribute-aligned=1 --with-batch="1 " --known-mpi-shared="0 " --known-mpi-shared-libraries=0 --known-memcmp-ok  --with-blas-lapack-lib=/opt/acml/5.3.1/gfortran64/lib/libacml.a --COPTFLAGS="-march=bdver1 -O3 -ffast-math -fPIC " --FOPTFLAGS="-march=bdver1 -O3 -ffast-math -fPIC " --CXXOPTFLAGS="-march=bdver1 -O3 -ffast-math -fPIC " --with-x="0 " --with-debugging="0 " --with-clib-autodetect="0 " --with-cxxlib-autodetect="0 " --with-fortranlib-autodetect="0 " --with-shared-libraries="0 " --with-mpi-compilers="1 " --with-cc="cc " --with-cxx="CC " --with-fc="ftn " --download-hypre="1 " --download-blacs="1 " --download-scalapack="1 " --download-superlu_dist="1 " --download-metis="1 " --download-parmetis="1 " PETSC_ARCH=gnu-opt
-----------------------------------------
Libraries compiled on Tue Feb 16 12:57:46 2016 on h2ologin3 
Machine characteristics: Linux-3.0.101-0.46-default-x86_64-with-SuSE-11-x86_64
Using PETSc directory: /mnt/a/u/sciteam/wang11/Sftw/petsc
Using PETSc arch: gnu-opt
-----------------------------------------

Using C compiler: cc   -march=bdver1 -O3 -ffast-math -fPIC  ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: ftn  -march=bdver1 -O3 -ffast-math -fPIC   ${FOPTFLAGS} ${FFLAGS} 
-----------------------------------------

Using include paths: -I/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/include -I/mnt/a/u/sciteam/wang11/Sftw/petsc/include -I/mnt/a/u/sciteam/wang11/Sftw/petsc/include -I/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/include
-----------------------------------------

Using C linker: cc
Using Fortran linker: ftn
Using libraries: -Wl,-rpath,/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/lib -L/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/lib -lpetsc -Wl,-rpath,/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/lib -L/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/lib -lsuperlu_dist_4.3 -lHYPRE -lscalapack -Wl,-rpath,/opt/acml/5.3.1/gfortran64/lib -L/opt/acml/5.3.1/gfortran64/lib -lacml -lparmetis -lmetis -lssl -lcrypto -ldl 
-----------------------------------------

#PETSc Option Table entries:
-ksp_converged_reason
-ksp_initial_guess_nonzero yes
-ksp_norm_type unpreconditioned
-ksp_rtol 1e-7
-ksp_type cg
-log_view
-matptap_scalable
-matrap 0
-memory_view
-mg_coarse_ksp_type preonly
-mg_coarse_pc_telescope_reduction_factor 8
-mg_coarse_pc_type telescope
-mg_coarse_telescope_ksp_type preonly
-mg_coarse_telescope_mg_coarse_ksp_type preonly
-mg_coarse_telescope_mg_coarse_pc_type redundant
-mg_coarse_telescope_mg_levels_ksp_max_it 1
-mg_coarse_telescope_mg_levels_ksp_type richardson
-mg_coarse_telescope_pc_mg_galerkin
-mg_coarse_telescope_pc_mg_levels 3
-mg_coarse_telescope_pc_type mg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_type richardson
-N 512
-options_left 1
-pc_mg_galerkin
-pc_mg_levels 4
-pc_type mg
-ppe_max_iter 20
-px 8
-py 8
-pz 8
#End of PETSc Option Table entries
There is one unused database option. It is:
Option left: name:-ppe_max_iter value: 20
Application 48712763 resources: utime ~3749s, stime ~789s, Rss ~196960, inblocks ~781565, outblocks ~505751
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: log_512_4096.txt
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161004/197a8ded/attachment-0006.txt>
-------------- next part --------------
Linear solve converged due to CONVERGED_RTOL iterations 7
 1 step time:    4.8914160728454590     
 norm1 error:      8.6827845637092041E-008
 norm inf error:   4.1127664509280201E-003
Summary of Memory Usage in PETSc
Maximum (over computational time) process memory:        total 1.9679e+09 max 1.1249e+05 min 4.1456e+04
Current process memory:                                  total 1.9679e+09 max 1.1249e+05 min 4.1456e+04
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./test_ksp.exe                                                                                                                                                                                                                                                   on a gnu-opt named . with 32768 processors, by wang11 Tue Oct  4 03:50:16 2016
Using Petsc Development GIT revision: v3.6.3-2059-geab7831  GIT Date: 2016-01-20 10:58:35 -0600

                         Max       Max/Min        Avg      Total 
Time (sec):           5.221e+00      1.00192   5.215e+00
Objects:              3.330e+02      1.72539   1.952e+02
Flops:                2.232e+09    531.65406   3.900e+07  1.278e+12
Flops/sec:            4.277e+08    531.89802   7.473e+06  2.449e+11
MPI Messages:         8.594e+03      4.55579   2.011e+03  6.589e+07
MPI Message Lengths:  1.078e+06      1.95814   2.782e+02  1.833e+10
MPI Reductions:       4.310e+02      1.60223

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 5.2149e+00 100.0%  1.2779e+12 100.0%  6.589e+07 100.0%  2.782e+02      100.0%  2.705e+02  62.8% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSidedF         1 1.0 6.2082e-02 6.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecTDot               14 1.0 1.5901e-02 2.1 1.15e+05 1.0 0.0e+00 0.0e+00 1.4e+01  0  0  0  0  3   0  0  0  0  5 236313
VecNorm                8 1.0 8.2795e-0299.5 6.55e+04 1.0 0.0e+00 0.0e+00 8.0e+00  1  0  0  0  2   1  0  0  0  3 25937
VecScale              28 2.0 4.6015e-0417.9 8.96e+03 2.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 324014
VecCopy                9 1.0 2.4486e-04 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               193 1.8 5.3072e-04 4.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               28 1.0 6.1011e-04 2.5 2.29e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0 12319342
VecAYPX               48 1.4 4.3058e-04 2.8 1.15e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 8416119
VecAssemblyBegin       1 1.0 6.2096e-02 6.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecAssemblyEnd         1 1.0 6.3896e-0567.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin      194 1.6 2.2339e-02 8.0 0.00e+00 0.0 4.3e+07 2.8e+02 0.0e+00  0  0 65 66  0   0  0 65 66  0     0
VecScatterEnd        194 1.6 3.7815e+0039.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 71  0  0  0  0  71  0  0  0  0     0
MatMult               56 1.3 7.7610e-02 7.5 1.55e+06 1.2 1.7e+07 5.6e+02 0.0e+00  0  3 26 53  0   0  3 26 53  0 563808
MatMultAdd            35 1.7 1.1928e-02 9.2 2.48e+05 1.1 4.9e+06 1.1e+02 0.0e+00  0  1  7  3  0   0  1  7  3  0 607627
MatMultTranspose      47 1.5 2.6726e-0213.3 2.84e+05 1.1 6.5e+06 9.9e+01 0.0e+00  0  1 10  3  0   0  1 10  3  0 310054
MatSolve               7 0.0 5.5102e-02 0.0 4.38e+07 0.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0   0  2  0  0  0 407368
MatSOR                70 1.7 2.0535e-02 3.7 1.70e+06 1.4 1.2e+07 9.8e+01 2.2e-01  0  3 18  7  0   0  3 18  7  0 1976428
MatLUFactorSym         1 0.0 1.4304e-01 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatLUFactorNum         1 0.0 3.0453e+00 0.0 2.18e+09 0.0 0.0e+00 0.0e+00 0.0e+00  1 87  0  0  0   1 87  0  0  0 366959
MatConvert             1 0.0 1.3890e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatResidual           35 1.7 7.3063e-0211.3 8.37e+05 1.4 1.3e+07 3.0e+02 0.0e+00  0  2 20 22  0   0  2 20 22  0 279200
MatAssemblyBegin      29 1.5 1.1239e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+01  2  0  0  0  5   2  0  0  0  7     0
MatAssemblyEnd        29 1.5 3.6328e-01 1.1 0.00e+00 0.0 8.9e+06 4.1e+01 7.3e+01  6  0 14  2 17   6  0 14  2 27     0
MatGetRowIJ            1 0.0 1.1570e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetSubMatrice       2 2.0 1.0665e-01 4.9 0.00e+00 0.0 1.6e+05 5.4e+02 3.1e+00  1  0  0  0  1   1  0  0  0  1     0
MatGetOrdering         1 0.0 8.1892e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatPtAP                6 1.5 4.1852e-01 1.0 7.98e+05 1.2 1.9e+07 3.0e+02 6.9e+01  8  2 28 30 16   8  2 28 30 25 50373
MatPtAPSymbolic        6 1.5 2.2612e-01 1.0 0.00e+00 0.0 1.1e+07 3.7e+02 2.8e+01  4  0 16 22  7   4  0 16 22 10     0
MatPtAPNumeric         6 1.5 1.9413e-01 1.0 7.98e+05 1.2 7.7e+06 2.0e+02 4.0e+01  4  2 12  8  9   4  2 12  8 15 108597
MatRedundantMat        1 0.0 2.9847e-02 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.2e-02  0  0  0  0  0   0  0  0  0  0     0
MatMPIConcateSeq       1 0.0 7.8937e-02 0.0 0.00e+00 0.0 2.7e+04 4.0e+01 2.3e-01  0  0  0  0  0   0  0  0  0  0     0
MatGetLocalMat         6 1.5 7.7701e-04 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetBrAoCol          6 1.5 1.9681e-02 3.1 0.00e+00 0.0 8.3e+06 3.9e+02 0.0e+00  0  0 13 18  0   0  0 13 18  0     0
MatGetSymTrans        12 1.5 2.0599e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
DMCoarsen              5 1.7 9.4588e-02 1.0 0.00e+00 0.0 1.2e+06 5.8e+01 3.3e+01  2  0  2  0  8   2  0  2  0 12     0
DMCreateInterpolation       5 1.7 2.1863e-01 1.0 3.54e+04 1.1 2.1e+06 5.8e+01 4.8e+01  4  0  3  1 11   4  0  3  1 18  4736
KSPSetUp              10 2.0 2.9837e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01  1  0  0  0  3   1  0  0  0  4     0
KSPSolve               1 1.0 4.8916e+00 1.0 2.23e+09531.7 6.5e+07 2.8e+02 2.4e+02 94100 99 98 56  94100 99 98 89 261253
PCSetUp                2 2.0 4.6506e+00 4.8 2.18e+093247.5 2.3e+07 2.5e+02 1.9e+02 20 89 35 32 44  20 89 35 32 71 245045
PCApply                7 1.0 3.7972e+00 1.0 2.23e+09794.1 4.2e+07 2.2e+02 1.6e+01 73 96 63 51  4  73 96 63 51  6 324561
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Vector   133            133       850544     0.
      Vector Scatter    24             24        68032     0.
              Matrix    58             58     42186948     0.
   Matrix Null Space     1              1          592     0.
    Distributed Mesh     7              7        34944     0.
Star Forest Bipartite Graph    14             14        11872     0.
     Discrete System     7              7         5992     0.
           Index Set    54             54       152244     0.
   IS L to G Mapping     7              7        37936     0.
       Krylov Solver    11             11        13640     0.
     DMKSP interface     5              5         3240     0.
      Preconditioner    11             11        11008     0.
              Viewer     1              0            0     0.
========================================================================================================================
Average time to get PetscTime(): 1.90735e-07
Average time for MPI_Barrier(): 6.00338e-05
Average time for zero size MPI_Send(): 1.25148e-05
#PETSc Option Table entries:
-ksp_converged_reason
-ksp_initial_guess_nonzero yes
-ksp_norm_type unpreconditioned
-ksp_rtol 1e-7
-ksp_type cg
-log_view
-matptap_scalable
-matrap 0
-memory_view
-mg_coarse_ksp_type preonly
-mg_coarse_pc_telescope_reduction_factor 64
-mg_coarse_pc_type telescope
-mg_coarse_telescope_ksp_type preonly
-mg_coarse_telescope_mg_coarse_ksp_type preonly
-mg_coarse_telescope_mg_coarse_pc_type redundant
-mg_coarse_telescope_mg_levels_ksp_max_it 1
-mg_coarse_telescope_mg_levels_ksp_type richardson
-mg_coarse_telescope_pc_mg_galerkin
-mg_coarse_telescope_pc_mg_levels 3
-mg_coarse_telescope_pc_type mg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_type richardson
-N 512
-options_left 1
-pc_mg_galerkin
-pc_mg_levels 4
-pc_type mg
-ppe_max_iter 20
-px 32
-py 32
-pz 32
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --known-level1-dcache-size=16384 --known-level1-dcache-linesize=64 --known-level1-dcache-assoc=4 --known-sizeof-char=1 --known-sizeof-void-p=8 --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8 --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8 --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-memcmp-ok=1 --known-sizeof-MPI_Comm=4 --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --known-mpi-int64_t=1 --known-mpi-c-double-complex=1 --known-sdot-returns-double=0 --known-snrm2-returns-double=0 --known-has-attribute-aligned=1 --with-batch="1 " --known-mpi-shared="0 " --known-mpi-shared-libraries=0 --known-memcmp-ok  --with-blas-lapack-lib=/opt/acml/5.3.1/gfortran64/lib/libacml.a --COPTFLAGS="-march=bdver1 -O3 -ffast-math -fPIC " --FOPTFLAGS="-march=bdver1 -O3 -ffast-math -fPIC " --CXXOPTFLAGS="-march=bdver1 -O3 -ffast-math -fPIC " --with-x="0 " --with-debugging="0 " --with-clib-autodetect="0 " --with-cxxlib-autodetect="0 " --with-fortranlib-autodetect="0 " --with-shared-libraries="0 " --with-mpi-compilers="1 " --with-cc="cc " --with-cxx="CC " --with-fc="ftn " --download-hypre="1 " --download-blacs="1 " --download-scalapack="1 " --download-superlu_dist="1 " --download-metis="1 " --download-parmetis="1 " PETSC_ARCH=gnu-opt
-----------------------------------------
Libraries compiled on Tue Feb 16 12:57:46 2016 on h2ologin3 
Machine characteristics: Linux-3.0.101-0.46-default-x86_64-with-SuSE-11-x86_64
Using PETSc directory: /mnt/a/u/sciteam/wang11/Sftw/petsc
Using PETSc arch: gnu-opt
-----------------------------------------

Using C compiler: cc   -march=bdver1 -O3 -ffast-math -fPIC  ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: ftn  -march=bdver1 -O3 -ffast-math -fPIC   ${FOPTFLAGS} ${FFLAGS} 
-----------------------------------------

Using include paths: -I/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/include -I/mnt/a/u/sciteam/wang11/Sftw/petsc/include -I/mnt/a/u/sciteam/wang11/Sftw/petsc/include -I/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/include
-----------------------------------------

Using C linker: cc
Using Fortran linker: ftn
Using libraries: -Wl,-rpath,/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/lib -L/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/lib -lpetsc -Wl,-rpath,/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/lib -L/mnt/a/u/sciteam/wang11/Sftw/petsc/gnu-opt/lib -lsuperlu_dist_4.3 -lHYPRE -lscalapack -Wl,-rpath,/opt/acml/5.3.1/gfortran64/lib -L/opt/acml/5.3.1/gfortran64/lib -lacml -lparmetis -lmetis -lssl -lcrypto -ldl 
-----------------------------------------

#PETSc Option Table entries:
-ksp_converged_reason
-ksp_initial_guess_nonzero yes
-ksp_norm_type unpreconditioned
-ksp_rtol 1e-7
-ksp_type cg
-log_view
-matptap_scalable
-matrap 0
-memory_view
-mg_coarse_ksp_type preonly
-mg_coarse_pc_telescope_reduction_factor 64
-mg_coarse_pc_type telescope
-mg_coarse_telescope_ksp_type preonly
-mg_coarse_telescope_mg_coarse_ksp_type preonly
-mg_coarse_telescope_mg_coarse_pc_type redundant
-mg_coarse_telescope_mg_levels_ksp_max_it 1
-mg_coarse_telescope_mg_levels_ksp_type richardson
-mg_coarse_telescope_pc_mg_galerkin
-mg_coarse_telescope_pc_mg_levels 3
-mg_coarse_telescope_pc_type mg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_type richardson
-N 512
-options_left 1
-pc_mg_galerkin
-pc_mg_levels 4
-pc_type mg
-ppe_max_iter 20
-px 32
-py 32
-pz 32
#End of PETSc Option Table entries
There is one unused database option. It is:
Option left: name:-ppe_max_iter value: 20
Application 48712514 resources: utime ~274648s, stime ~36467s, Rss ~112492, inblocks ~29956998, outblocks ~32114238
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: log_1024_4096.txt
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161004/197a8ded/attachment-0007.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: log_1024_8192.txt
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161004/197a8ded/attachment-0008.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: log_1024_16384.txt
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161004/197a8ded/attachment-0009.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: log_1024_32768.txt
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161004/197a8ded/attachment-0010.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: log_1024_65536.txt
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161004/197a8ded/attachment-0011.txt>