[petsc-users] Question about memory usage in Multigrid preconditioner
Hengjie Wang
hengjiew at uci.edu
Fri Sep 16 12:53:26 CDT 2016
Hi Dave,
I add both options and test it by solving the poisson eqn in a 1024
cube with 32^3 cores. This test used to give the OOM error. Now it runs
well.
I attach the ksp_view and log_view's output in case you want to know.
I also test my original code with those petsc options by simulating a
decaying turbulence in a 1024 cube. It also works. I am going to test
the code on a larger scale. If there is any problem then, I will let you
know.
This really helps a lot. Thank you so much.
Regards,
Frank
On 9/15/2016 3:35 AM, Dave May wrote:
> HI all,
>
> I the only unexpected memory usage I can see is associated with the
> call to MatPtAP().
> Here is something you can try immediately.
> Run your code with the additional options
> -matrap 0 -matptap_scalable
>
> I didn't realize this before, but the default behaviour of MatPtAP in
> parallel is actually to to explicitly form the transpose of P (e.g.
> assemble R = P^T) and then compute R.A.P.
> You don't want to do this. The option -matrap 0 resolves this issue.
>
> The implementation of P^T.A.P has two variants.
> The scalable implementation (with respect to memory usage) is selected
> via the second option -matptap_scalable.
>
> Try it out - I see a significant memory reduction using these options
> for particular mesh sizes / partitions.
>
> I've attached a cleaned up version of the code you sent me.
> There were a number of memory leaks and other issues.
> The main points being
> * You should call DMDAVecGetArrayF90() before VecAssembly{Begin,End}
> * You should call PetscFinalize(), otherwise the option -log_summary
> (-log_view) will not display anything once the program has completed.
>
>
> Thanks,
> Dave
>
>
> On 15 September 2016 at 08:03, Hengjie Wang <hengjiew at uci.edu
> <mailto:hengjiew at uci.edu>> wrote:
>
> Hi Dave,
>
> Sorry, I should have put more comment to explain the code.
> The number of process in each dimension is the same: Px = Py=Pz=P.
> So is the domain size.
> So if the you want to run the code for a 512^3 grid points on
> 16^3 cores, you need to set "-N 512 -P 16" in the command line.
> I add more comments and also fix an error in the attached code. (
> The error only effects the accuracy of solution but not the memory
> usage. )
>
> Thank you.
> Frank
>
>
> On 9/14/2016 9:05 PM, Dave May wrote:
>>
>>
>> On Thursday, 15 September 2016, Dave May <dave.mayhem23 at gmail.com
>> <mailto:dave.mayhem23 at gmail.com>> wrote:
>>
>>
>>
>> On Thursday, 15 September 2016, frank <hengjiew at uci.edu> wrote:
>>
>> Hi,
>>
>> I write a simple code to re-produce the error. I hope
>> this can help to diagnose the problem.
>> The code just solves a 3d poisson equation.
>>
>>
>> Why is the stencil width a runtime parameter?? And why is the
>> default value 2? For 7-pnt FD Laplace, you only need
>> a stencil width of 1.
>>
>> Was this choice made to mimic something in the
>> real application code?
>>
>>
>> Please ignore - I misunderstood your usage of the param set by -P
>>
>>
>> I run the code on a 1024^3 mesh. The process partition is
>> 32 * 32 * 32. That's when I re-produce the OOM error.
>> Each core has about 2G memory.
>> I also run the code on a 512^3 mesh with 16 * 16 * 16
>> processes. The ksp solver works fine.
>> I attached the code, ksp_view_pre's output and my petsc
>> option file.
>>
>> Thank you.
>> Frank
>>
>> On 09/09/2016 06:38 PM, Hengjie Wang wrote:
>>> Hi Barry,
>>>
>>> I checked. On the supercomputer, I had the option
>>> "-ksp_view_pre" but it is not in file I sent you. I am
>>> sorry for the confusion.
>>>
>>> Regards,
>>> Frank
>>>
>>> On Friday, September 9, 2016, Barry Smith
>>> <bsmith at mcs.anl.gov> wrote:
>>>
>>>
>>> > On Sep 9, 2016, at 3:11 PM, frank
>>> <hengjiew at uci.edu> wrote:
>>> >
>>> > Hi Barry,
>>> >
>>> > I think the first KSP view output is from
>>> -ksp_view_pre. Before I submitted the test, I was
>>> not sure whether there would be OOM error or not. So
>>> I added both -ksp_view_pre and -ksp_view.
>>>
>>> But the options file you sent specifically does
>>> NOT list the -ksp_view_pre so how could it be from that?
>>>
>>> Sorry to be pedantic but I've spent too much time
>>> in the past trying to debug from incorrect
>>> information and want to make sure that the
>>> information I have is correct before thinking.
>>> Please recheck exactly what happened. Rerun with the
>>> exact input file you emailed if that is needed.
>>>
>>> Barry
>>>
>>> >
>>> > Frank
>>> >
>>> >
>>> > On 09/09/2016 12:38 PM, Barry Smith wrote:
>>> >> Why does ksp_view2.txt have two KSP views in it
>>> while ksp_view1.txt has only one KSPView in it? Did
>>> you run two different solves in the 2 case but not
>>> the one?
>>> >>
>>> >> Barry
>>> >>
>>> >>
>>> >>
>>> >>> On Sep 9, 2016, at 10:56 AM, frank
>>> <hengjiew at uci.edu> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> I want to continue digging into the memory
>>> problem here.
>>> >>> I did find a work around in the past, which is
>>> to use less cores per node so that each core has 8G
>>> memory. However this is deficient and expensive. I
>>> hope to locate the place that uses the most memory.
>>> >>>
>>> >>> Here is a brief summary of the tests I did in past:
>>> >>>> Test1: Mesh 1536*128*384 | Process Mesh 48*4*12
>>> >>> Maximum (over computational time) process
>>> memory: total 7.0727e+08
>>> >>> Current process memory: total
>>> 7.0727e+08
>>> >>> Maximum (over computational time) space
>>> PetscMalloc()ed: total 6.3908e+11
>>> >>> Current space PetscMalloc()ed:
>>> total 1.8275e+09
>>> >>>
>>> >>>> Test2: Mesh 1536*128*384 | Process Mesh
>>> 96*8*24
>>> >>> Maximum (over computational time) process
>>> memory: total 5.9431e+09
>>> >>> Current process memory: total
>>> 5.9431e+09
>>> >>> Maximum (over computational time) space
>>> PetscMalloc()ed: total 5.3202e+12
>>> >>> Current space PetscMalloc()ed:
>>> total 5.4844e+09
>>> >>>
>>> >>>> Test3: Mesh 3072*256*768 | Process Mesh
>>> 96*8*24
>>> >>> OOM( Out Of Memory ) killer of the
>>> supercomputer terminated the job during "KSPSolve".
>>> >>>
>>> >>> I attached the output of ksp_view( the third
>>> test's output is from ksp_view_pre ), memory_view
>>> and also the petsc options.
>>> >>>
>>> >>> In all the tests, each core can access about 2G
>>> memory. In test3, there are 4223139840 non-zeros in
>>> the matrix. This will consume about 1.74M, using
>>> double precision. Considering some extra memory used
>>> to store integer index, 2G memory should still be
>>> way enough.
>>> >>>
>>> >>> Is there a way to find out which part of
>>> KSPSolve uses the most memory?
>>> >>> Thank you so much.
>>> >>>
>>> >>> BTW, there are 4 options remains unused and I
>>> don't understand why they are omitted:
>>> >>> -mg_coarse_telescope_mg_coarse_ksp_type value:
>>> preonly
>>> >>> -mg_coarse_telescope_mg_coarse_pc_type value:
>>> bjacobi
>>> >>> -mg_coarse_telescope_mg_levels_ksp_max_it value: 1
>>> >>> -mg_coarse_telescope_mg_levels_ksp_type value:
>>> richardson
>>> >>>
>>> >>>
>>> >>> Regards,
>>> >>> Frank
>>> >>>
>>> >>> On 07/13/2016 05:47 PM, Dave May wrote:
>>> >>>>
>>> >>>> On 14 July 2016 at 01:07, frank
>>> <hengjiew at uci.edu> wrote:
>>> >>>> Hi Dave,
>>> >>>>
>>> >>>> Sorry for the late reply.
>>> >>>> Thank you so much for your detailed reply.
>>> >>>>
>>> >>>> I have a question about the estimation of the
>>> memory usage. There are 4223139840 allocated
>>> non-zeros and 18432 MPI processes. Double precision
>>> is used. So the memory per process is:
>>> >>>> 4223139840 * 8bytes / 18432 / 1024 / 1024 =
>>> 1.74M ?
>>> >>>> Did I do sth wrong here? Because this seems too
>>> small.
>>> >>>>
>>> >>>> No - I totally f***ed it up. You are correct.
>>> That'll teach me for fumbling around with my iphone
>>> calculator and not using my brain. (Note that to
>>> convert to MB just divide by 1e6, not 1024^2 -
>>> although I apparently cannot convert between units
>>> correctly....)
>>> >>>>
>>> >>>> From the PETSc objects associated with the
>>> solver, It looks like it _should_ run with 2GB per
>>> MPI rank. Sorry for my mistake. Possibilities are:
>>> somewhere in your usage of PETSc you've introduced a
>>> memory leak; PETSc is doing a huge over allocation
>>> (e.g. as per our discussion of MatPtAP); or in your
>>> application code there are other objects you have
>>> forgotten to log the memory for.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> I am running this job on Bluewater
>>> >>>> I am using the 7 points FD stencil in 3D.
>>> >>>>
>>> >>>> I thought so on both counts.
>>> >>>>
>>> >>>> I apologize that I made a stupid mistake in
>>> computing the memory per core. My settings render
>>> each core can access only 2G memory on average
>>> instead of 8G which I mentioned in previous email. I
>>> re-run the job with 8G memory per core on average
>>> and there is no "Out Of Memory" error. I would do
>>> more test to see if there is still some memory issue.
>>> >>>>
>>> >>>> Ok. I'd still like to know where the memory was
>>> being used since my estimates were off.
>>> >>>>
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Dave
>>> >>>>
>>> >>>> Regards,
>>> >>>> Frank
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On 07/11/2016 01:18 PM, Dave May wrote:
>>> >>>>> Hi Frank,
>>> >>>>>
>>> >>>>>
>>> >>>>> On 11 July 2016 at 19:14, frank
>>> <hengjiew at uci.edu> wrote:
>>> >>>>> Hi Dave,
>>> >>>>>
>>> >>>>> I re-run the test using bjacobi as the
>>> preconditioner on the coarse mesh of telescope. The
>>> Grid is 3072*256*768 and process mesh is 96*8*24.
>>> The petsc option file is attached.
>>> >>>>> I still got the "Out Of Memory" error. The
>>> error occurred before the linear solver finished one
>>> step. So I don't have the full info from ksp_view.
>>> The info from ksp_view_pre is attached.
>>> >>>>>
>>> >>>>> Okay - that is essentially useless (sorry)
>>> >>>>>
>>> >>>>> It seems to me that the error occurred when
>>> the decomposition was going to be changed.
>>> >>>>>
>>> >>>>> Based on what information?
>>> >>>>> Running with -info would give us more clues,
>>> but will create a ton of output.
>>> >>>>> Please try running the case which failed with
>>> -info
>>> >>>>> I had another test with a grid of
>>> 1536*128*384 and the same process mesh as above.
>>> There was no error. The ksp_view info is attached
>>> for comparison.
>>> >>>>> Thank you.
>>> >>>>>
>>> >>>>>
>>> >>>>> [3] Here is my crude estimate of your memory
>>> usage.
>>> >>>>> I'll target the biggest memory hogs only to
>>> get an order of magnitude estimate
>>> >>>>>
>>> >>>>> * The Fine grid operator contains 4223139840
>>> non-zeros --> 1.8 GB per MPI rank assuming double
>>> precision.
>>> >>>>> The indices for the AIJ could amount to
>>> another 0.3 GB (assuming 32 bit integers)
>>> >>>>>
>>> >>>>> * You use 5 levels of coarsening, so the other
>>> operators should represent (collectively)
>>> >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4 ~ 300
>>> MB per MPI rank on the communicator with 18432 ranks.
>>> >>>>> The coarse grid should consume ~ 0.5 MB per
>>> MPI rank on the communicator with 18432 ranks.
>>> >>>>>
>>> >>>>> * You use a reduction factor of 64, making the
>>> new communicator with 288 MPI ranks.
>>> >>>>> PCTelescope will first gather a temporary
>>> matrix associated with your coarse level operator
>>> assuming a comm size of 288 living on the comm with
>>> size 18432.
>>> >>>>> This matrix will require approximately 0.5 *
>>> 64 = 32 MB per core on the 288 ranks.
>>> >>>>> This matrix is then used to form a new MPIAIJ
>>> matrix on the subcomm, thus require another 32 MB
>>> per rank.
>>> >>>>> The temporary matrix is now destroyed.
>>> >>>>>
>>> >>>>> * Because a DMDA is detected, a permutation
>>> matrix is assembled.
>>> >>>>> This requires 2 doubles per point in the DMDA.
>>> >>>>> Your coarse DMDA contains 92 x 16 x 48 points.
>>> >>>>> Thus the permutation matrix will require < 1
>>> MB per MPI rank on the sub-comm.
>>> >>>>>
>>> >>>>> * Lastly, the matrix is permuted. This uses
>>> MatPtAP(), but the resulting operator will have the
>>> same memory footprint as the unpermuted matrix (32
>>> MB). At any stage in PCTelescope, only 2 operators
>>> of size 32 MB are held in memory when the DMDA is
>>> provided.
>>> >>>>>
>>> >>>>> From my rough estimates, the worst case memory
>>> foot print for any given core, given your options is
>>> approximately
>>> >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB = 2465 MB
>>> >>>>> This is way below 8 GB.
>>> >>>>>
>>> >>>>> Note this estimate completely ignores:
>>> >>>>> (1) the memory required for the restriction
>>> operator,
>>> >>>>> (2) the potential growth in the number of
>>> non-zeros per row due to Galerkin coarsening (I
>>> wished -ksp_view_pre reported the output from
>>> MatView so we could see the number of non-zeros
>>> required by the coarse level operators)
>>> >>>>> (3) all temporary vectors required by the CG
>>> solver, and those required by the smoothers.
>>> >>>>> (4) internal memory allocated by MatPtAP
>>> >>>>> (5) memory associated with IS's used within
>>> PCTelescope
>>> >>>>>
>>> >>>>> So either I am completely off in my estimates,
>>> or you have not carefully estimated the memory usage
>>> of your application code. Hopefully others might
>>> examine/correct my rough estimates
>>> >>>>>
>>> >>>>> Since I don't have your code I cannot access
>>> the latter.
>>> >>>>> Since I don't have access to the same machine
>>> you are running on, I think we need to take a step back.
>>> >>>>>
>>> >>>>> [1] What machine are you running on? Send me a
>>> URL if its available
>>> >>>>>
>>> >>>>> [2] What discretization are you using? (I am
>>> guessing a scalar 7 point FD stencil)
>>> >>>>> If it's a 7 point FD stencil, we should be
>>> able to examine the memory usage of your solver
>>> configuration using a standard, light weight
>>> existing PETSc example, run on your machine at the
>>> same scale.
>>> >>>>> This would hopefully enable us to correctly
>>> evaluate the actual memory usage required by the
>>> solver configuration you are using.
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Dave
>>> >>>>>
>>> >>>>>
>>> >>>>> Frank
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On 07/08/2016 10:38 PM, Dave May wrote:
>>> >>>>>>
>>> >>>>>> On Saturday, 9 July 2016, frank
>>> <hengjiew at uci.edu> wrote:
>>> >>>>>> Hi Barry and Dave,
>>> >>>>>>
>>> >>>>>> Thank both of you for the advice.
>>> >>>>>>
>>> >>>>>> @Barry
>>> >>>>>> I made a mistake in the file names in last
>>> email. I attached the correct files this time.
>>> >>>>>> For all the three tests, 'Telescope' is used
>>> as the coarse preconditioner.
>>> >>>>>>
>>> >>>>>> == Test1: Grid: 1536*128*384, Process
>>> Mesh: 48*4*12
>>> >>>>>> Part of the memory usage: Vector 125 124
>>> 3971904 0.
>>> >>>>>> Matrix 101 101 9462372 0
>>> >>>>>>
>>> >>>>>> == Test2: Grid: 1536*128*384, Process Mesh:
>>> 96*8*24
>>> >>>>>> Part of the memory usage: Vector 125 124
>>> 681672 0.
>>> >>>>>> Matrix 101 101 1462180 0.
>>> >>>>>>
>>> >>>>>> In theory, the memory usage in Test1 should
>>> be 8 times of Test2. In my case, it is about 6 times.
>>> >>>>>>
>>> >>>>>> == Test3: Grid: 3072*256*768, Process Mesh:
>>> 96*8*24. Sub-domain per process: 32*32*32
>>> >>>>>> Here I get the out of memory error.
>>> >>>>>>
>>> >>>>>> I tried to use -mg_coarse jacobi. In this
>>> way, I don't need to set -mg_coarse_ksp_type and
>>> -mg_coarse_pc_type explicitly, right?
>>> >>>>>> The linear solver didn't work in this case.
>>> Petsc output some errors.
>>> >>>>>>
>>> >>>>>> @Dave
>>> >>>>>> In test3, I use only one instance of
>>> 'Telescope'. On the coarse mesh of 'Telescope', I
>>> used LU as the preconditioner instead of SVD.
>>> >>>>>> If my set the levels correctly, then on the
>>> last coarse mesh of MG where it calls 'Telescope',
>>> the sub-domain per process is 2*2*2.
>>> >>>>>> On the last coarse mesh of 'Telescope', there
>>> is only one grid point per process.
>>> >>>>>> I still got the OOM error. The detailed petsc
>>> option file is attached.
>>> >>>>>>
>>> >>>>>> Do you understand the expected memory usage
>>> for the particular parallel LU implementation you
>>> are using? I don't (seriously). Replace LU with
>>> bjacobi and re-run this test. My point about solver
>>> debugging is still valid.
>>> >>>>>>
>>> >>>>>> And please send the result of KSPView so we
>>> can see what is actually used in the computations
>>> >>>>>>
>>> >>>>>> Thanks
>>> >>>>>> Dave
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Thank you so much.
>>> >>>>>>
>>> >>>>>> Frank
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:
>>> >>>>>> On Jul 6, 2016, at 4:19 PM, frank
>>> <hengjiew at uci.edu> wrote:
>>> >>>>>>
>>> >>>>>> Hi Barry,
>>> >>>>>>
>>> >>>>>> Thank you for you advice.
>>> >>>>>> I tried three test. In the 1st test, the grid
>>> is 3072*256*768 and the process mesh is 96*8*24.
>>> >>>>>> The linear solver is 'cg' the preconditioner
>>> is 'mg' and 'telescope' is used as the
>>> preconditioner at the coarse mesh.
>>> >>>>>> The system gives me the "Out of Memory" error
>>> before the linear system is completely solved.
>>> >>>>>> The info from '-ksp_view_pre' is attached. I
>>> seems to me that the error occurs when it reaches
>>> the coarse mesh.
>>> >>>>>>
>>> >>>>>> The 2nd test uses a grid of 1536*128*384 and
>>> process mesh is 96*8*24. The 3rd test uses the same
>>> grid but a different process mesh 48*4*12.
>>> >>>>>> Are you sure this is right? The total
>>> matrix and vector memory usage goes from 2nd test
>>> >>>>>> Vector 384 383 8,193,712 0.
>>> >>>>>> Matrix 103 103 11,508,688 0.
>>> >>>>>> to 3rd test
>>> >>>>>> Vector 384 383 1,590,520 0.
>>> >>>>>> Matrix 103 103 3,508,664 0.
>>> >>>>>> that is the memory usage got smaller but if
>>> you have only 1/8th the processes and the same grid
>>> it should have gotten about 8 times bigger. Did you
>>> maybe cut the grid by a factor of 8 also? If so that
>>> still doesn't explain it because the memory usage
>>> changed by a factor of 5 something for the vectors
>>> and 3 something for the matrices.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> The linear solver and petsc options in 2nd
>>> and 3rd tests are the same in 1st test. The linear
>>> solver works fine in both test.
>>> >>>>>> I attached the memory usage of the 2nd and
>>> 3rd tests. The memory info is from the option
>>> '-log_summary'. I tried to use '-momery_info' as you
>>> suggested, but in my case petsc treated it as an
>>> unused option. It output nothing about the memory.
>>> Do I need to add sth to my code so I can use
>>> '-memory_info'?
>>> >>>>>> Sorry, my mistake the option is -memory_view
>>> >>>>>>
>>> >>>>>> Can you run the one case with -memory_view
>>> and -mg_coarse jacobi -ksp_max_it 1 (just so it
>>> doesn't iterate forever) to see how much memory is
>>> used without the telescope? Also run case 2 the same
>>> way.
>>> >>>>>>
>>> >>>>>> Barry
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> In both tests the memory usage is not large.
>>> >>>>>>
>>> >>>>>> It seems to me that it might be the
>>> 'telescope' preconditioner that allocated a lot of
>>> memory and caused the error in the 1st test.
>>> >>>>>> Is there is a way to show how much memory it
>>> allocated?
>>> >>>>>>
>>> >>>>>> Frank
>>> >>>>>>
>>> >>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote:
>>> >>>>>> Frank,
>>> >>>>>>
>>> >>>>>> You can run with -ksp_view_pre to have
>>> it "view" the KSP before the solve so hopefully it
>>> gets that far.
>>> >>>>>>
>>> >>>>>> Please run the problem that does fit
>>> with -memory_info when the problem completes it will
>>> show the "high water mark" for PETSc allocated
>>> memory and total memory used. We first want to look
>>> at these numbers to see if it is using more memory
>>> than you expect. You could also run with say half
>>> the grid spacing to see how the memory usage scaled
>>> with the increase in grid points. Make the runs also
>>> with -log_view and send all the output from these
>>> options.
>>> >>>>>>
>>> >>>>>> Barry
>>> >>>>>>
>>> >>>>>> On Jul 5, 2016, at 5:23 PM, frank
>>> <hengjiew at uci.edu> wrote:
>>> >>>>>>
>>> >>>>>> Hi,
>>> >>>>>>
>>> >>>>>> I am using the CG ksp solver and Multigrid
>>> preconditioner to solve a linear system in parallel.
>>> >>>>>> I chose to use the 'Telescope' as the
>>> preconditioner on the coarse mesh for its good
>>> performance.
>>> >>>>>> The petsc options file is attached.
>>> >>>>>>
>>> >>>>>> The domain is a 3d box.
>>> >>>>>> It works well when the grid is 1536*128*384
>>> and the process mesh is 96*8*24. When I double the
>>> size of grid and keep
>>> the same process mesh and petsc options, I get an
>>> "out of memory" error from the super-cluster I am using.
>>> >>>>>> Each process has access to at least 8G
>>> memory, which should be more than enough for my
>>> application. I am sure that all the other parts of
>>> my code( except the linear solver ) do not use much
>>> memory. So I doubt if there is something wrong with
>>> the linear solver.
>>> >>>>>> The error occurs before the linear system is
>>> completely solved so I don't have the info from ksp
>>> view. I am not able to re-produce the error with a
>>> smaller problem either.
>>> >>>>>> In addition, I tried to use the block jacobi
>>> as the preconditioner with the same grid and same
>>> decomposition. The linear solver runs extremely slow
>>> but there is no memory error.
>>> >>>>>>
>>> >>>>>> How can I diagnose what exactly cause the error?
>>> >>>>>> Thank you so much.
>>> >>>>>>
>>> >>>>>> Frank
>>> >>>>>> <petsc_options.txt>
>>> >>>>>>
>>> <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><memory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt>
>>> >
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160916/abb0082c/attachment-0001.html>
-------------- next part --------------
KSP Object: 32768 MPI processes
type: cg
maximum iterations=10000
tolerances: relative=1e-07, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using UNPRECONDITIONED norm type for convergence test
PC Object: 32768 MPI processes
type: mg
MG: type is MULTIPLICATIVE, levels=5 cycles=v
Cycles per PCApply=1
Using Galerkin computed coarse grid matrices
Coarse grid solver -- level -------------------------------
KSP Object: (mg_coarse_) 32768 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (mg_coarse_) 32768 MPI processes
type: telescope
Telescope: parent comm size reduction factor = 64
Telescope: comm_size = 32768 , subcomm_size = 512
Telescope: subcomm type: interlaced
Telescope: DMDA detected
DMDA Object: (mg_coarse_telescope_repart_) 512 MPI processes
M 64 N 64 P 64 m 8 n 8 p 8 dof 1 overlap 1
KSP Object: (mg_coarse_telescope_) 512 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (mg_coarse_telescope_) 512 MPI processes
type: mg
MG: type is MULTIPLICATIVE, levels=3 cycles=v
Cycles per PCApply=1
Using Galerkin computed coarse grid matrices
Coarse grid solver -- level -------------------------------
KSP Object: (mg_coarse_telescope_mg_coarse_) 512 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (mg_coarse_telescope_mg_coarse_) 512 MPI processes
type: redundant
Redundant preconditioner: First (color=0) of 512 PCs follows
linear system matrix = precond matrix:
Mat Object: 512 MPI processes
type: mpiaij
rows=4096, cols=4096
total: nonzeros=110592, allocated nonzeros=110592
total number of mallocs used during MatSetValues calls =0
using I-node (on process 0) routines: found 2 nodes, limit used is 5
Down solver (pre-smoother) on level 1 -------------------------------
KSP Object: (mg_coarse_telescope_mg_levels_1_) 512 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_coarse_telescope_mg_levels_1_) 512 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 512 MPI processes
type: mpiaij
rows=32768, cols=32768
total: nonzeros=884736, allocated nonzeros=884736
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 2 -------------------------------
KSP Object: (mg_coarse_telescope_mg_levels_2_) 512 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_coarse_telescope_mg_levels_2_) 512 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 512 MPI processes
type: mpiaij
rows=262144, cols=262144
total: nonzeros=7077888, allocated nonzeros=7077888
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Up solver (post-smoother) same as down solver (pre-smoother)
linear system matrix = precond matrix:
Mat Object: 512 MPI processes
type: mpiaij
rows=262144, cols=262144
total: nonzeros=7077888, allocated nonzeros=7077888
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
KSP Object: (mg_coarse_telescope_mg_coarse_redundant_) 1 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (mg_coarse_telescope_mg_coarse_redundant_) 1 MPI processes
type: bjacobi
block Jacobi: number of blocks = 1
Local solve is same for all blocks, in the following KSP and PC objects:
KSP Object: (mg_coarse_telescope_mg_coarse_redundant_sub_) 1 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (mg_coarse_telescope_mg_coarse_redundant_sub_) 1 MPI processes
type: ilu
ILU: out-of-place factorization
0 levels of fill
tolerance for zero pivot 2.22045e-14
matrix ordering: natural
factor fill ratio given 1., needed 1.
Factored matrix follows:
Mat Object: 1 MPI processes
type: seqaij
rows=4096, cols=4096
package used to perform factorization: petsc
total: nonzeros=110592, allocated nonzeros=110592
total number of mallocs used during MatSetValues calls =0
not using I-node routines
linear system matrix = precond matrix:
Mat Object: 1 MPI processes
type: seqaij
rows=4096, cols=4096
total: nonzeros=110592, allocated nonzeros=110592
total number of mallocs used during MatSetValues calls =0
not using I-node routines
linear system matrix = precond matrix:
Mat Object: 1 MPI processes
type: seqaij
rows=4096, cols=4096
total: nonzeros=110592, allocated nonzeros=110592
total number of mallocs used during MatSetValues calls =0
not using I-node routines
linear system matrix = precond matrix:
Mat Object: 32768 MPI processes
type: mpiaij
rows=262144, cols=262144
total: nonzeros=7077888, allocated nonzeros=7077888
total number of mallocs used during MatSetValues calls =0
using I-node (on process 0) routines: found 2 nodes, limit used is 5
Down solver (pre-smoother) on level 1 -------------------------------
KSP Object: (mg_levels_1_) 32768 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_levels_1_) 32768 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 32768 MPI processes
type: mpiaij
rows=2097152, cols=2097152
total: nonzeros=56623104, allocated nonzeros=56623104
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 2 -------------------------------
KSP Object: (mg_levels_2_) 32768 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_levels_2_) 32768 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 32768 MPI processes
type: mpiaij
rows=16777216, cols=16777216
total: nonzeros=452984832, allocated nonzeros=452984832
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 3 -------------------------------
KSP Object: (mg_levels_3_) 32768 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_levels_3_) 32768 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 32768 MPI processes
type: mpiaij
rows=134217728, cols=134217728
total: nonzeros=3623878656, allocated nonzeros=3623878656
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 4 -------------------------------
KSP Object: (mg_levels_4_) 32768 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_levels_4_) 32768 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 32768 MPI processes
type: mpiaij
rows=1073741824, cols=1073741824
total: nonzeros=7516192768, allocated nonzeros=7516192768
total number of mallocs used during MatSetValues calls =0
has attached null space
Up solver (post-smoother) same as down solver (pre-smoother)
linear system matrix = precond matrix:
Mat Object: 32768 MPI processes
type: mpiaij
rows=1073741824, cols=1073741824
total: nonzeros=7516192768, allocated nonzeros=7516192768
total number of mallocs used during MatSetValues calls =0
has attached null space
-------------- next part --------------
32768 processors, by hengjie Fri Sep 16 04:29:10 2016
Using Petsc Development GIT revision: v3.7.3-1056-geeb1ceb GIT Date: 2016-08-02 10:00:58 -0500
Max Max/Min Avg Total
Time (sec): 3.595e+01 1.00092 3.595e+01
Objects: 4.240e+02 1.61217 2.655e+02
Flops: 7.348e+07 1.09866 6.699e+07 2.195e+12
Flops/sec: 2.044e+06 1.09875 1.863e+06 6.106e+10
Memory: 1.110e+09 1.00000 3.636e+13
MPI Messages: 5.004e+04 11.27696 4.668e+03 1.530e+08
MPI Message Lengths: 4.805e+06 1.27794 8.088e+02 1.237e+11
MPI Reductions: 2.296e+03 1.48994
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flops
and VecAXPY() for complex vectors of length N --> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total Avg %Total counts %Total
0: Main Stage: 3.5947e+01 100.0% 2.1951e+12 100.0% 1.530e+08 100.0% 8.088e+02 100.0% 1.551e+03 67.5%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
VecTDot 30 1.0 1.9905e-01 1.4 1.97e+06 1.0 0.0e+00 0.0e+00 6.0e+01 1 3 0 0 3 1 3 0 0 4 323650
VecNorm 16 1.0 3.9425e-01 3.5 1.05e+06 1.0 0.0e+00 0.0e+00 3.2e+01 1 2 0 0 1 1 2 0 0 2 87152
VecScale 75 1.7 2.3286e-02 2.0 4.52e+04 1.3 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 50363
VecCopy 17 1.0 3.8621e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 442 1.7 9.8095e-03 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 60 1.0 3.5868e-02 1.3 3.93e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 6 0 0 0 0 6 0 0 0 3592294
VecAYPX 119 1.3 1.7319e-02 1.3 1.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 3728684
VecAssemblyBegin 1 1.0 1.0757e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 1 1.0 2.7490e-04 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecScatterBegin 471 1.5 5.8588e-02 3.4 0.00e+00 0.0 1.2e+08 8.1e+02 0.0e+00 0 0 81 81 0 0 0 81 81 0 0
VecScatterEnd 471 1.5 1.2934e+00 6.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatMult 135 1.3 2.8880e-01 1.4 2.33e+07 1.0 5.0e+07 1.7e+03 0.0e+00 1 34 32 66 0 1 34 32 66 0 2597254
MatMultAdd 90 1.5 1.1149e-01 2.9 3.85e+06 1.0 1.4e+07 3.2e+02 0.0e+00 0 6 9 4 0 0 6 9 4 0 1114404
MatMultTranspose 111 1.4 3.0435e-01 1.3 4.11e+06 1.0 1.7e+07 2.8e+02 8.0e+01 1 6 11 4 3 1 6 11 4 5 435479
MatSolve 15 0.0 2.0206e-02 0.0 3.26e+06 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 82513
MatSOR 180 1.5 9.9816e-01 1.3 2.32e+07 1.0 3.9e+07 2.4e+02 1.2e+00 2 33 25 8 0 2 33 25 8 0 727846
MatLUFactorNum 1 0.0 2.4225e-02 0.0 1.60e+06 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 33762
MatILUFactorSym 1 0.0 2.5048e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatConvert 1 0.0 7.5793e-04 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatResidual 90 1.5 3.7126e-01 1.2 1.11e+07 1.0 4.2e+07 8.0e+02 6.0e+01 1 16 27 27 3 1 16 27 27 4 942007
MatAssemblyBegin 33 1.4 7.2762e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 4.8e+01 2 0 0 0 2 2 0 0 0 3 0
MatAssemblyEnd 33 1.4 1.4643e+00 1.1 0.00e+00 0.0 1.1e+07 1.2e+02 2.5e+02 4 0 7 1 11 4 0 7 1 16 0
MatGetRowIJ 1 0.0 1.5974e-05 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetSubMatrice 2 2.0 3.4627e-01 3.7 0.00e+00 0.0 1.6e+05 5.4e+02 6.1e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 0.0 1.9929e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatView 13 2.2 1.0639e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01 0 0 0 0 1 0 0 0 0 1 0
MatPtAP 7 1.4 5.2281e+00 1.0 5.25e+06 1.0 2.4e+07 8.8e+02 2.1e+02 14 8 15 17 9 14 8 15 17 14 31939
MatPtAPSymbolic 7 1.4 4.0818e+00 1.0 0.00e+00 0.0 1.4e+07 1.1e+03 7.5e+01 11 0 9 12 3 11 0 9 12 5 0
MatPtAPNumeric 7 1.4 1.1755e+00 1.0 5.25e+06 1.0 9.6e+06 5.7e+02 1.4e+02 3 8 6 4 6 3 8 6 4 9 142046
MatRedundantMat 1 0.0 1.3647e-02 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 7.8e-02 0 0 0 0 0 0 0 0 0 0 0
MatMPIConcateSeq 1 0.0 2.7197e-01 0.0 0.00e+00 0.0 2.7e+04 4.0e+01 6.1e-01 0 0 0 0 0 0 0 0 0 0 0
MatGetLocalMat 7 1.4 1.3259e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetBrAoCol 7 1.4 6.9566e-02 2.8 0.00e+00 0.0 1.1e+07 1.1e+03 0.0e+00 0 0 7 10 0 0 0 7 10 0 0
MatGetSymTrans 14 1.4 2.2139e-02 5.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
DMCoarsen 6 1.5 3.3237e-01 1.1 0.00e+00 0.0 1.6e+06 1.7e+02 2.1e+02 1 0 1 0 9 1 0 1 0 13 0
DMCreateInterp 6 1.5 7.6958e-01 1.1 2.57e+05 1.0 2.8e+06 1.6e+02 2.0e+02 2 0 2 0 9 2 0 2 0 13 10763
KSPSetUp 12 2.0 1.1138e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.5e+01 0 0 0 0 2 0 0 0 0 2 0
KSPSolve 1 1.0 1.2628e+01 1.0 7.35e+07 1.1 1.5e+08 8.0e+02 1.4e+03 35100 99 99 59 35100 99 99 87 173826
PCSetUp 3 3.0 9.2140e+00 1.1 7.10e+06 1.3 2.9e+07 7.4e+02 7.9e+02 23 8 19 18 34 23 8 19 18 51 19110
PCSetUpOnBlocks 15 0.0 2.8822e-02 0.0 1.60e+06 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 28377
PCApply 15 1.0 3.5384e+00 1.0 5.58e+07 1.1 1.2e+08 6.3e+02 3.7e+02 10 74 79 62 16 10 74 79 62 24 457052
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Vector 197 197 4396000 0.
Vector Scatter 27 27 333392 0.
Matrix 66 66 14132608 0.
Matrix Null Space 1 1 592 0.
Distributed Mesh 8 8 40832 0.
Star Forest Bipartite Graph 16 16 13568 0.
Discrete System 8 8 7008 0.
Index Set 60 60 341672 0.
IS L to G Mapping 8 8 195776 0.
Krylov Solver 12 12 14760 0.
DMKSP interface 6 6 3888 0.
Preconditioner 12 12 11928 0.
Viewer 3 2 1664 0.
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
Average time for MPI_Barrier(): 0.000146198
Average time for zero size MPI_Send(): 3.66852e-06
More information about the petsc-users
mailing list