[petsc-users] Performance of the Telescope Multigrid Preconditioner
frank
hengjiew at uci.edu
Thu Oct 6 19:33:16 CDT 2016
Dear Dave,
Follow your advice, I solve the identical equation twice and time two
steps separately. The result is below:
Test: 1024^3 grid points
Cores# reduction factor MG levels# time of 1st solve 2nd time
4096 64 6 + 3
3.85 1.75
8192 128 5 + 3
5.52 0.91
16384 256 5 + 3 5.37
0.52
32768 512 5 + 4 3.03
0.36
32768 64 | 8 4 | 3 | 3 2.80
0.43
65536 1024 5 + 4 3.38
0.59
65536 32 | 32 4 | 4 | 3 2.14
0.22
I also attached the log_view info from all the run. The file is names
by the cores# + reduction factor.
The ksp_view and petsc_options for the 1st run are also included.
Others are similar. The only differences are the reduction factor and mg
levels.
** The time for the 1st solve is generally much larger. Is this because
the ksp solver on the sub-communicator is set up during the 1st solve?
** The time for 1st solve does not scale.
In practice, I am solving a variable coefficient Poisson equation.
I need to build the matrix every time step. Therefore, each step is
similar to the 1st solve which does not scale. Is there a way I can
improve the performance?
** The 2nd solve scales but not quite well for more than 16384 cores.
It seems to me that the performance depends on the tuning of MG
levels on the sub-communicator(s).
Is there some general strategies regarding how to distribute the
levels? or when to use multiple sub-communicators ?
Thank you.
Regards,
Frank
On 10/04/2016 12:56 PM, Dave May wrote:
>
>
> On Tuesday, 4 October 2016, frank <hengjiew at uci.edu
> <mailto:hengjiew at uci.edu>> wrote:
>
> Hi,
>
> This question is follow-up of the thread "Question about memory
> usage in Multigrid preconditioner".
> I used to have the "Out of Memory(OOM)" problem when using the
> CG+Telescope MG solver with 32768 cores. Adding the "-matrap 0;
> -matptap_scalable" option did solve that problem.
>
> Then I test the scalability by solving a 3d poisson eqn for 1
> step. I used one sub-communicator in all the tests. The difference
> between the petsc options in those tests are: 1 the
> pc_telescope_reduction_factor; 2 the number of multigrid levels in
> the up/down solver. The function "ksp_solve" is timed. It is kind
> of slow and doesn't scale at all.
>
> Test1: 512^3 grid points
> Core# telescope_reduction_factor MG levels# for up/down
> solver Time for KSPSolve (s)
> 512 8 4 / 3 6.2466
> 4096 64 5 / 3 0.9361
> 32768 64 4 / 3 4.8914
>
> Test2: 1024^3 grid points
> Core# telescope_reduction_factor MG levels# for up/down
> solver Time for KSPSolve (s)
> 4096 64 5 / 4 3.4139
> 8192 128 5 / 4 2.4196
> 16384 32 5 / 3 5.4150
> 32768 64 5 / 3 5.6067
> 65536 128 5 / 3 6.5219
>
>
> You have to be very careful how you interpret these numbers. Your
> solver contains nested calls to KSPSolve, and unfortunately as a
> result the numbers you report include setup time. This will remain
> true even if you call KSPSetUp on the outermost KSP.
>
> Your email concerns scalability of the silver application, so let's
> focus on that issue.
>
> The only way to clearly separate setup from solve time is to perform
> two identical solves. The second solve will not require any setup. You
> should monitor the second solve via a new PetscStage.
>
> This was what I did in the telescope paper. It was the only way to
> understand the setup cost (and scaling) cf the solve time (and scaling).
>
> Thanks
> Dave
>
> I guess I didn't set the MG levels properly. What would be the
> efficient way to arrange the MG levels?
> Also which preconditionr at the coarse mesh of the 2nd
> communicator should I use to improve the performance?
>
> I attached the test code and the petsc options file for the 1024^3
> cube with 32768 cores.
>
> Thank you.
>
> Regards,
> Frank
>
>
>
>
>
>
> On 09/15/2016 03:35 AM, Dave May wrote:
>> HI all,
>>
>> I the only unexpected memory usage I can see is associated with
>> the call to MatPtAP().
>> Here is something you can try immediately.
>> Run your code with the additional options
>> -matrap 0 -matptap_scalable
>>
>> I didn't realize this before, but the default behaviour of
>> MatPtAP in parallel is actually to to explicitly form the
>> transpose of P (e.g. assemble R = P^T) and then compute R.A.P.
>> You don't want to do this. The option -matrap 0 resolves this issue.
>>
>> The implementation of P^T.A.P has two variants.
>> The scalable implementation (with respect to memory usage) is
>> selected via the second option -matptap_scalable.
>>
>> Try it out - I see a significant memory reduction using these
>> options for particular mesh sizes / partitions.
>>
>> I've attached a cleaned up version of the code you sent me.
>> There were a number of memory leaks and other issues.
>> The main points being
>> * You should call DMDAVecGetArrayF90() before
>> VecAssembly{Begin,End}
>> * You should call PetscFinalize(), otherwise the option
>> -log_summary (-log_view) will not display anything once the
>> program has completed.
>>
>>
>> Thanks,
>> Dave
>>
>>
>> On 15 September 2016 at 08:03, Hengjie Wang <hengjiew at uci.edu
>> <javascript:_e(%7B%7D,'cvml','hengjiew at uci.edu');>> wrote:
>>
>> Hi Dave,
>>
>> Sorry, I should have put more comment to explain the code.
>> The number of process in each dimension is the same: Px =
>> Py=Pz=P. So is the domain size.
>> So if the you want to run the code for a 512^3 grid points
>> on 16^3 cores, you need to set "-N 512 -P 16" in the command
>> line.
>> I add more comments and also fix an error in the attached
>> code. ( The error only effects the accuracy of solution but
>> not the memory usage. )
>>
>> Thank you.
>> Frank
>>
>>
>> On 9/14/2016 9:05 PM, Dave May wrote:
>>>
>>>
>>> On Thursday, 15 September 2016, Dave May
>>> <dave.mayhem23 at gmail.com
>>> <javascript:_e(%7B%7D,'cvml','dave.mayhem23 at gmail.com');>>
>>> wrote:
>>>
>>>
>>>
>>> On Thursday, 15 September 2016, frank <hengjiew at uci.edu>
>>> wrote:
>>>
>>> Hi,
>>>
>>> I write a simple code to re-produce the error. I
>>> hope this can help to diagnose the problem.
>>> The code just solves a 3d poisson equation.
>>>
>>>
>>> Why is the stencil width a runtime parameter?? And why
>>> is the default value 2? For 7-pnt FD Laplace, you only
>>> need a stencil width of 1.
>>>
>>> Was this choice made to mimic something in the
>>> real application code?
>>>
>>>
>>> Please ignore - I misunderstood your usage of the param set
>>> by -P
>>>
>>>
>>> I run the code on a 1024^3 mesh. The process
>>> partition is 32 * 32 * 32. That's when I re-produce
>>> the OOM error. Each core has about 2G memory.
>>> I also run the code on a 512^3 mesh with 16 * 16 *
>>> 16 processes. The ksp solver works fine.
>>> I attached the code, ksp_view_pre's output and my
>>> petsc option file.
>>>
>>> Thank you.
>>> Frank
>>>
>>> On 09/09/2016 06:38 PM, Hengjie Wang wrote:
>>>> Hi Barry,
>>>>
>>>> I checked. On the supercomputer, I had the option
>>>> "-ksp_view_pre" but it is not in file I sent you. I
>>>> am sorry for the confusion.
>>>>
>>>> Regards,
>>>> Frank
>>>>
>>>> On Friday, September 9, 2016, Barry Smith
>>>> <bsmith at mcs.anl.gov> wrote:
>>>>
>>>>
>>>> > On Sep 9, 2016, at 3:11 PM, frank
>>>> <hengjiew at uci.edu> wrote:
>>>> >
>>>> > Hi Barry,
>>>> >
>>>> > I think the first KSP view output is from
>>>> -ksp_view_pre. Before I submitted the test, I
>>>> was not sure whether there would be OOM error
>>>> or not. So I added both -ksp_view_pre and
>>>> -ksp_view.
>>>>
>>>> But the options file you sent specifically
>>>> does NOT list the -ksp_view_pre so how could it
>>>> be from that?
>>>>
>>>> Sorry to be pedantic but I've spent too much
>>>> time in the past trying to debug from incorrect
>>>> information and want to make sure that the
>>>> information I have is correct before thinking.
>>>> Please recheck exactly what happened. Rerun
>>>> with the exact input file you emailed if that
>>>> is needed.
>>>>
>>>> Barry
>>>>
>>>> >
>>>> > Frank
>>>> >
>>>> >
>>>> > On 09/09/2016 12:38 PM, Barry Smith wrote:
>>>> >> Why does ksp_view2.txt have two KSP views
>>>> in it while ksp_view1.txt has only one KSPView
>>>> in it? Did you run two different solves in the
>>>> 2 case but not the one?
>>>> >>
>>>> >> Barry
>>>> >>
>>>> >>
>>>> >>
>>>> >>> On Sep 9, 2016, at 10:56 AM, frank
>>>> <hengjiew at uci.edu> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I want to continue digging into the memory
>>>> problem here.
>>>> >>> I did find a work around in the past, which
>>>> is to use less cores per node so that each core
>>>> has 8G memory. However this is deficient and
>>>> expensive. I hope to locate the place that uses
>>>> the most memory.
>>>> >>>
>>>> >>> Here is a brief summary of the tests I did
>>>> in past:
>>>> >>>> Test1: Mesh 1536*128*384 | Process Mesh
>>>> 48*4*12
>>>> >>> Maximum (over computational time) process
>>>> memory: total 7.0727e+08
>>>> >>> Current process memory: total 7.0727e+08
>>>> >>> Maximum (over computational time) space
>>>> PetscMalloc()ed: total 6.3908e+11
>>>> >>> Current space PetscMalloc()ed:
>>>> total 1.8275e+09
>>>> >>>
>>>> >>>> Test2: Mesh 1536*128*384 | Process Mesh
>>>> 96*8*24
>>>> >>> Maximum (over computational time) process
>>>> memory: total 5.9431e+09
>>>> >>> Current process memory: total 5.9431e+09
>>>> >>> Maximum (over computational time) space
>>>> PetscMalloc()ed: total 5.3202e+12
>>>> >>> Current space PetscMalloc()ed:
>>>> total 5.4844e+09
>>>> >>>
>>>> >>>> Test3: Mesh 3072*256*768 | Process Mesh
>>>> 96*8*24
>>>> >>> OOM( Out Of Memory ) killer of the
>>>> supercomputer terminated the job during "KSPSolve".
>>>> >>>
>>>> >>> I attached the output of ksp_view( the
>>>> third test's output is from ksp_view_pre ),
>>>> memory_view and also the petsc options.
>>>> >>>
>>>> >>> In all the tests, each core can access
>>>> about 2G memory. In test3, there are 4223139840
>>>> non-zeros in the matrix. This will consume
>>>> about 1.74M, using double precision.
>>>> Considering some extra memory used to store
>>>> integer index, 2G memory should still be way
>>>> enough.
>>>> >>>
>>>> >>> Is there a way to find out which part of
>>>> KSPSolve uses the most memory?
>>>> >>> Thank you so much.
>>>> >>>
>>>> >>> BTW, there are 4 options remains unused and
>>>> I don't understand why they are omitted:
>>>> >>> -mg_coarse_telescope_mg_coarse_ksp_type
>>>> value: preonly
>>>> >>> -mg_coarse_telescope_mg_coarse_pc_type
>>>> value: bjacobi
>>>> >>> -mg_coarse_telescope_mg_levels_ksp_max_it
>>>> value: 1
>>>> >>> -mg_coarse_telescope_mg_levels_ksp_type
>>>> value: richardson
>>>> >>>
>>>> >>>
>>>> >>> Regards,
>>>> >>> Frank
>>>> >>>
>>>> >>> On 07/13/2016 05:47 PM, Dave May wrote:
>>>> >>>>
>>>> >>>> On 14 July 2016 at 01:07, frank
>>>> <hengjiew at uci.edu> wrote:
>>>> >>>> Hi Dave,
>>>> >>>>
>>>> >>>> Sorry for the late reply.
>>>> >>>> Thank you so much for your detailed reply.
>>>> >>>>
>>>> >>>> I have a question about the estimation of
>>>> the memory usage. There are 4223139840
>>>> allocated non-zeros and 18432 MPI processes.
>>>> Double precision is used. So the memory per
>>>> process is:
>>>> >>>> 4223139840 * 8bytes / 18432 / 1024 / 1024
>>>> = 1.74M ?
>>>> >>>> Did I do sth wrong here? Because this
>>>> seems too small.
>>>> >>>>
>>>> >>>> No - I totally f***ed it up. You are
>>>> correct. That'll teach me for fumbling around
>>>> with my iphone calculator and not using my
>>>> brain. (Note that to convert to MB just divide
>>>> by 1e6, not 1024^2 - although I apparently
>>>> cannot convert between units correctly....)
>>>> >>>>
>>>> >>>> From the PETSc objects associated with the
>>>> solver, It looks like it _should_ run with 2GB
>>>> per MPI rank. Sorry for my mistake.
>>>> Possibilities are: somewhere in your usage of
>>>> PETSc you've introduced a memory leak; PETSc is
>>>> doing a huge over allocation (e.g. as per our
>>>> discussion of MatPtAP); or in your application
>>>> code there are other objects you have forgotten
>>>> to log the memory for.
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> I am running this job on Bluewater
>>>> >>>> I am using the 7 points FD stencil in 3D.
>>>> >>>>
>>>> >>>> I thought so on both counts.
>>>> >>>>
>>>> >>>> I apologize that I made a stupid mistake
>>>> in computing the memory per core. My settings
>>>> render each core can access only 2G memory on
>>>> average instead of 8G which I mentioned in
>>>> previous email. I re-run the job with 8G memory
>>>> per core on average and there is no "Out Of
>>>> Memory" error. I would do more test to see if
>>>> there is still some memory issue.
>>>> >>>>
>>>> >>>> Ok. I'd still like to know where the
>>>> memory was being used since my estimates were off.
>>>> >>>>
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> Dave
>>>> >>>>
>>>> >>>> Regards,
>>>> >>>> Frank
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On 07/11/2016 01:18 PM, Dave May wrote:
>>>> >>>>> Hi Frank,
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On 11 July 2016 at 19:14, frank
>>>> <hengjiew at uci.edu> wrote:
>>>> >>>>> Hi Dave,
>>>> >>>>>
>>>> >>>>> I re-run the test using bjacobi as the
>>>> preconditioner on the coarse mesh of telescope.
>>>> The Grid is 3072*256*768 and process mesh is
>>>> 96*8*24. The petsc option file is attached.
>>>> >>>>> I still got the "Out Of Memory" error.
>>>> The error occurred before the linear solver
>>>> finished one step. So I don't have the full
>>>> info from ksp_view. The info from ksp_view_pre
>>>> is attached.
>>>> >>>>>
>>>> >>>>> Okay - that is essentially useless (sorry)
>>>> >>>>>
>>>> >>>>> It seems to me that the error occurred
>>>> when the decomposition was going to be changed.
>>>> >>>>>
>>>> >>>>> Based on what information?
>>>> >>>>> Running with -info would give us more
>>>> clues, but will create a ton of output.
>>>> >>>>> Please try running the case which failed
>>>> with -info
>>>> >>>>> I had another test with a grid of
>>>> 1536*128*384 and the same process mesh as
>>>> above. There was no error. The ksp_view info is
>>>> attached for comparison.
>>>> >>>>> Thank you.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> [3] Here is my crude estimate of your
>>>> memory usage.
>>>> >>>>> I'll target the biggest memory hogs only
>>>> to get an order of magnitude estimate
>>>> >>>>>
>>>> >>>>> * The Fine grid operator contains
>>>> 4223139840 non-zeros --> 1.8 GB per MPI rank
>>>> assuming double precision.
>>>> >>>>> The indices for the AIJ could amount to
>>>> another 0.3 GB (assuming 32 bit integers)
>>>> >>>>>
>>>> >>>>> * You use 5 levels of coarsening, so the
>>>> other operators should represent (collectively)
>>>> >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4 ~
>>>> 300 MB per MPI rank on the communicator with
>>>> 18432 ranks.
>>>> >>>>> The coarse grid should consume ~ 0.5 MB
>>>> per MPI rank on the communicator with 18432 ranks.
>>>> >>>>>
>>>> >>>>> * You use a reduction factor of 64,
>>>> making the new communicator with 288 MPI ranks.
>>>> >>>>> PCTelescope will first gather a temporary
>>>> matrix associated with your coarse level
>>>> operator assuming a comm size of 288 living on
>>>> the comm with size 18432.
>>>> >>>>> This matrix will require approximately
>>>> 0.5 * 64 = 32 MB per core on the 288 ranks.
>>>> >>>>> This matrix is then used to form a new
>>>> MPIAIJ matrix on the subcomm, thus require
>>>> another 32 MB per rank.
>>>> >>>>> The temporary matrix is now destroyed.
>>>> >>>>>
>>>> >>>>> * Because a DMDA is detected, a
>>>> permutation matrix is assembled.
>>>> >>>>> This requires 2 doubles per point in the
>>>> DMDA.
>>>> >>>>> Your coarse DMDA contains 92 x 16 x 48
>>>> points.
>>>> >>>>> Thus the permutation matrix will require
>>>> < 1 MB per MPI rank on the sub-comm.
>>>> >>>>>
>>>> >>>>> * Lastly, the matrix is permuted. This
>>>> uses MatPtAP(), but the resulting operator will
>>>> have the same memory footprint as the
>>>> unpermuted matrix (32 MB). At any stage in
>>>> PCTelescope, only 2 operators of size 32 MB are
>>>> held in memory when the DMDA is provided.
>>>> >>>>>
>>>> >>>>> From my rough estimates, the worst case
>>>> memory foot print for any given core, given
>>>> your options is approximately
>>>> >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB
>>>> = 2465 MB
>>>> >>>>> This is way below 8 GB.
>>>> >>>>>
>>>> >>>>> Note this estimate completely ignores:
>>>> >>>>> (1) the memory required for the
>>>> restriction operator,
>>>> >>>>> (2) the potential growth in the number of
>>>> non-zeros per row due to Galerkin coarsening (I
>>>> wished -ksp_view_pre reported the output from
>>>> MatView so we could see the number of non-zeros
>>>> required by the coarse level operators)
>>>> >>>>> (3) all temporary vectors required by the
>>>> CG solver, and those required by the smoothers.
>>>> >>>>> (4) internal memory allocated by MatPtAP
>>>> >>>>> (5) memory associated with IS's used
>>>> within PCTelescope
>>>> >>>>>
>>>> >>>>> So either I am completely off in my
>>>> estimates, or you have not carefully estimated
>>>> the memory usage of your application code.
>>>> Hopefully others might examine/correct my rough
>>>> estimates
>>>> >>>>>
>>>> >>>>> Since I don't have your code I cannot
>>>> access the latter.
>>>> >>>>> Since I don't have access to the same
>>>> machine you are running on, I think we need to
>>>> take a step back.
>>>> >>>>>
>>>> >>>>> [1] What machine are you running on? Send
>>>> me a URL if its available
>>>> >>>>>
>>>> >>>>> [2] What discretization are you using? (I
>>>> am guessing a scalar 7 point FD stencil)
>>>> >>>>> If it's a 7 point FD stencil, we should
>>>> be able to examine the memory usage of your
>>>> solver configuration using a standard, light
>>>> weight existing PETSc example, run on your
>>>> machine at the same scale.
>>>> >>>>> This would hopefully enable us to
>>>> correctly evaluate the actual memory usage
>>>> required by the solver configuration you are using.
>>>> >>>>>
>>>> >>>>> Thanks,
>>>> >>>>> Dave
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> Frank
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On 07/08/2016 10:38 PM, Dave May wrote:
>>>> >>>>>>
>>>> >>>>>> On Saturday, 9 July 2016, frank
>>>> <hengjiew at uci.edu> wrote:
>>>> >>>>>> Hi Barry and Dave,
>>>> >>>>>>
>>>> >>>>>> Thank both of you for the advice.
>>>> >>>>>>
>>>> >>>>>> @Barry
>>>> >>>>>> I made a mistake in the file names in
>>>> last email. I attached the correct files this time.
>>>> >>>>>> For all the three tests, 'Telescope' is
>>>> used as the coarse preconditioner.
>>>> >>>>>>
>>>> >>>>>> == Test1: Grid: 1536*128*384,
>>>> Process Mesh: 48*4*12
>>>> >>>>>> Part of the memory usage: Vector 125
>>>> 124 3971904 0.
>>>> >>>>>> Matrix 101 101
>>>> 9462372 0
>>>> >>>>>>
>>>> >>>>>> == Test2: Grid: 1536*128*384, Process
>>>> Mesh: 96*8*24
>>>> >>>>>> Part of the memory usage: Vector 125
>>>> 124 681672 0.
>>>> >>>>>> Matrix 101 101
>>>> 1462180 0.
>>>> >>>>>>
>>>> >>>>>> In theory, the memory usage in Test1
>>>> should be 8 times of Test2. In my case, it is
>>>> about 6 times.
>>>> >>>>>>
>>>> >>>>>> == Test3: Grid: 3072*256*768, Process
>>>> Mesh: 96*8*24. Sub-domain per process: 32*32*32
>>>> >>>>>> Here I get the out of memory error.
>>>> >>>>>>
>>>> >>>>>> I tried to use -mg_coarse jacobi. In
>>>> this way, I don't need to set
>>>> -mg_coarse_ksp_type and -mg_coarse_pc_type
>>>> explicitly, right?
>>>> >>>>>> The linear solver didn't work in this
>>>> case. Petsc output some errors.
>>>> >>>>>>
>>>> >>>>>> @Dave
>>>> >>>>>> In test3, I use only one instance of
>>>> 'Telescope'. On the coarse mesh of 'Telescope',
>>>> I used LU as the preconditioner instead of SVD.
>>>> >>>>>> If my set the levels correctly, then on
>>>> the last coarse mesh of MG where it calls
>>>> 'Telescope', the sub-domain per process is 2*2*2.
>>>> >>>>>> On the last coarse mesh of 'Telescope',
>>>> there is only one grid point per process.
>>>> >>>>>> I still got the OOM error. The detailed
>>>> petsc option file is attached.
>>>> >>>>>>
>>>> >>>>>> Do you understand the expected memory
>>>> usage for the particular parallel LU
>>>> implementation you are using? I don't
>>>> (seriously). Replace LU with bjacobi and re-run
>>>> this test. My point about solver debugging is
>>>> still valid.
>>>> >>>>>>
>>>> >>>>>> And please send the result of KSPView so
>>>> we can see what is actually used in the
>>>> computations
>>>> >>>>>>
>>>> >>>>>> Thanks
>>>> >>>>>> Dave
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Thank you so much.
>>>> >>>>>>
>>>> >>>>>> Frank
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:
>>>> >>>>>> On Jul 6, 2016, at 4:19 PM, frank
>>>> <hengjiew at uci.edu> wrote:
>>>> >>>>>>
>>>> >>>>>> Hi Barry,
>>>> >>>>>>
>>>> >>>>>> Thank you for you advice.
>>>> >>>>>> I tried three test. In the 1st test, the
>>>> grid is 3072*256*768 and the process mesh is
>>>> 96*8*24.
>>>> >>>>>> The linear solver is 'cg' the
>>>> preconditioner is 'mg' and 'telescope' is used
>>>> as the preconditioner at the coarse mesh.
>>>> >>>>>> The system gives me the "Out of Memory"
>>>> error before the linear system is completely
>>>> solved.
>>>> >>>>>> The info from '-ksp_view_pre' is
>>>> attached. I seems to me that the error occurs
>>>> when it reaches the coarse mesh.
>>>> >>>>>>
>>>> >>>>>> The 2nd test uses a grid of 1536*128*384
>>>> and process mesh is 96*8*24. The 3rd
>>>> test uses the same grid but a different
>>>> process mesh 48*4*12.
>>>> >>>>>> Are you sure this is right? The total
>>>> matrix and vector memory usage goes from 2nd test
>>>> >>>>>> Vector 384 383
>>>> 8,193,712 0.
>>>> >>>>>> Matrix 103 103
>>>> 11,508,688 0.
>>>> >>>>>> to 3rd test
>>>> >>>>>> Vector 384 383
>>>> 1,590,520 0.
>>>> >>>>>> Matrix 103 103
>>>> 3,508,664 0.
>>>> >>>>>> that is the memory usage got smaller but
>>>> if you have only 1/8th the processes and the
>>>> same grid it should have gotten about 8 times
>>>> bigger. Did you maybe cut the grid by a factor
>>>> of 8 also? If so that still doesn't explain it
>>>> because the memory usage changed by a factor of
>>>> 5 something for the vectors and 3 something for
>>>> the matrices.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> The linear solver and petsc options in
>>>> 2nd and 3rd tests are the same in 1st test. The
>>>> linear solver works fine in both test.
>>>> >>>>>> I attached the memory usage of the 2nd
>>>> and 3rd tests. The memory info is from the
>>>> option '-log_summary'. I tried to use
>>>> '-momery_info' as you suggested, but in my case
>>>> petsc treated it as an unused option. It output
>>>> nothing about the memory. Do I need to add sth
>>>> to my code so I can use '-memory_info'?
>>>> >>>>>> Sorry, my mistake the option is
>>>> -memory_view
>>>> >>>>>>
>>>> >>>>>> Can you run the one case with
>>>> -memory_view and -mg_coarse jacobi -ksp_max_it
>>>> 1 (just so it doesn't iterate forever) to see
>>>> how much memory is used without the telescope?
>>>> Also run case 2 the same way.
>>>> >>>>>>
>>>> >>>>>> Barry
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> In both tests the memory usage is not large.
>>>> >>>>>>
>>>> >>>>>> It seems to me that it might be the
>>>> 'telescope' preconditioner that allocated a lot
>>>> of memory and caused the error in the 1st test.
>>>> >>>>>> Is there is a way to show how much
>>>> memory it allocated?
>>>> >>>>>>
>>>> >>>>>> Frank
>>>> >>>>>>
>>>> >>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote:
>>>> >>>>>> Frank,
>>>> >>>>>>
>>>> >>>>>> You can run with -ksp_view_pre to have
>>>> it "view" the KSP before the solve so hopefully
>>>> it gets that far.
>>>> >>>>>>
>>>> >>>>>> Please run the problem that does fit
>>>> with -memory_info when the problem completes it
>>>> will show the "high water mark" for PETSc
>>>> allocated memory and total memory used. We
>>>> first want to look at these numbers to see if
>>>> it is using more memory than you expect. You
>>>> could also run with say half the grid spacing
>>>> to see how the memory usage scaled with the
>>>> increase in grid points. Make the runs also
>>>> with -log_view and send all the output from
>>>> these options.
>>>> >>>>>>
>>>> >>>>>> Barry
>>>> >>>>>>
>>>> >>>>>> On Jul 5, 2016, at 5:23 PM, frank
>>>> <hengjiew at uci.edu> wrote:
>>>> >>>>>>
>>>> >>>>>> Hi,
>>>> >>>>>>
>>>> >>>>>> I am using the CG ksp solver and
>>>> Multigrid preconditioner to solve a linear
>>>> system in parallel.
>>>> >>>>>> I chose to use the 'Telescope' as the
>>>> preconditioner on the coarse mesh for its good
>>>> performance.
>>>> >>>>>> The petsc options file is attached.
>>>> >>>>>>
>>>> >>>>>> The domain is a 3d box.
>>>> >>>>>> It works well when the grid is
>>>> 1536*128*384 and the process mesh is 96*8*24.
>>>> When I double the size of grid and
>>>> keep the same process mesh and petsc
>>>> options, I get an "out of memory" error from
>>>> the super-cluster I am using.
>>>> >>>>>> Each process has access to at least 8G
>>>> memory, which should be more than enough for my
>>>> application. I am sure that all the other parts
>>>> of my code( except the linear solver ) do not
>>>> use much memory. So I doubt if there is
>>>> something wrong with the linear solver.
>>>> >>>>>> The error occurs before the linear
>>>> system is completely solved so I don't have the
>>>> info from ksp view. I am not able to re-produce
>>>> the error with a smaller problem either.
>>>> >>>>>> In addition, I tried to use the block
>>>> jacobi as the preconditioner with the same grid
>>>> and same decomposition. The linear solver runs
>>>> extremely slow but there is no memory error.
>>>> >>>>>>
>>>> >>>>>> How can I diagnose what exactly cause
>>>> the error?
>>>> >>>>>> Thank you so much.
>>>> >>>>>>
>>>> >>>>>> Frank
>>>> >>>>>> <petsc_options.txt>
>>>> >>>>>>
>>>> <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><memory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt>
>>>> >
>>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161006/a198348a/attachment-0001.html>
-------------- next part --------------
Linear solve converged due to CONVERGED_RTOL iterations 7
KSP Object: 4096 MPI processes
type: cg
maximum iterations=10000
tolerances: relative=1e-07, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using UNPRECONDITIONED norm type for convergence test
PC Object: 4096 MPI processes
type: mg
MG: type is MULTIPLICATIVE, levels=6 cycles=v
Cycles per PCApply=1
Using Galerkin computed coarse grid matrices
Coarse grid solver -- level -------------------------------
KSP Object: (mg_coarse_) 4096 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (mg_coarse_) 4096 MPI processes
type: telescope
Telescope: parent comm size reduction factor = 64
Telescope: comm_size = 4096 , subcomm_size = 64
Telescope: subcomm type: interlaced
Telescope: DMDA detected
DMDA Object: (mg_coarse_telescope_repart_) 64 MPI processes
M 32 N 32 P 32 m 4 n 4 p 4 dof 1 overlap 1
KSP Object: (mg_coarse_telescope_) 64 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (mg_coarse_telescope_) 64 MPI processes
type: mg
MG: type is MULTIPLICATIVE, levels=3 cycles=v
Cycles per PCApply=1
Using Galerkin computed coarse grid matrices
Coarse grid solver -- level -------------------------------
KSP Object: (mg_coarse_telescope_mg_coarse_) 64 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (mg_coarse_telescope_mg_coarse_) 64 MPI processes
type: redundant
Redundant preconditioner: First (color=0) of 64 PCs follows
linear system matrix = precond matrix:
Mat Object: 64 MPI processes
type: mpiaij
rows=512, cols=512
total: nonzeros=13824, allocated nonzeros=13824
total number of mallocs used during MatSetValues calls =0
using I-node (on process 0) routines: found 2 nodes, limit used is 5
Down solver (pre-smoother) on level 1 -------------------------------
KSP Object: (mg_coarse_telescope_mg_levels_1_) 64 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_coarse_telescope_mg_levels_1_) 64 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 64 MPI processes
type: mpiaij
rows=4096, cols=4096
total: nonzeros=110592, allocated nonzeros=110592
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 2 -------------------------------
KSP Object: (mg_coarse_telescope_mg_levels_2_) 64 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_coarse_telescope_mg_levels_2_) 64 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 64 MPI processes
type: mpiaij
rows=32768, cols=32768
total: nonzeros=884736, allocated nonzeros=884736
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Up solver (post-smoother) same as down solver (pre-smoother)
linear system matrix = precond matrix:
Mat Object: 64 MPI processes
type: mpiaij
rows=32768, cols=32768
total: nonzeros=884736, allocated nonzeros=884736
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
KSP Object: (mg_coarse_telescope_mg_coarse_redundant_) 1 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (mg_coarse_telescope_mg_coarse_redundant_) 1 MPI processes
type: lu
out-of-place factorization
tolerance for zero pivot 2.22045e-14
using diagonal shift on blocks to prevent zero pivot [INBLOCKS]
matrix ordering: nd
factor fill ratio given 5., needed 8.69575
Factored matrix follows:
Mat Object: 1 MPI processes
type: seqaij
rows=512, cols=512
package used to perform factorization: petsc
total: nonzeros=120210, allocated nonzeros=120210
total number of mallocs used during MatSetValues calls =0
not using I-node routines
linear system matrix = precond matrix:
Mat Object: 1 MPI processes
type: seqaij
rows=512, cols=512
total: nonzeros=13824, allocated nonzeros=13824
total number of mallocs used during MatSetValues calls =0
not using I-node routines
linear system matrix = precond matrix:
Mat Object: 4096 MPI processes
type: mpiaij
rows=32768, cols=32768
total: nonzeros=884736, allocated nonzeros=884736
total number of mallocs used during MatSetValues calls =0
using I-node (on process 0) routines: found 2 nodes, limit used is 5
Down solver (pre-smoother) on level 1 -------------------------------
KSP Object: (mg_levels_1_) 4096 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_levels_1_) 4096 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 4096 MPI processes
type: mpiaij
rows=262144, cols=262144
total: nonzeros=7077888, allocated nonzeros=7077888
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 2 -------------------------------
KSP Object: (mg_levels_2_) 4096 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_levels_2_) 4096 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 4096 MPI processes
type: mpiaij
rows=2097152, cols=2097152
total: nonzeros=56623104, allocated nonzeros=56623104
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 3 -------------------------------
KSP Object: (mg_levels_3_) 4096 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_levels_3_) 4096 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 4096 MPI processes
type: mpiaij
rows=16777216, cols=16777216
total: nonzeros=452984832, allocated nonzeros=452984832
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 4 -------------------------------
KSP Object: (mg_levels_4_) 4096 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_levels_4_) 4096 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 4096 MPI processes
type: mpiaij
rows=134217728, cols=134217728
total: nonzeros=3623878656, allocated nonzeros=3623878656
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 5 -------------------------------
KSP Object: (mg_levels_5_) 4096 MPI processes
type: richardson
Richardson: damping factor=1.
maximum iterations=1
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (mg_levels_5_) 4096 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 4096 MPI processes
type: mpiaij
rows=1073741824, cols=1073741824
total: nonzeros=7516192768, allocated nonzeros=7516192768
total number of mallocs used during MatSetValues calls =0
has attached null space
Up solver (post-smoother) same as down solver (pre-smoother)
linear system matrix = precond matrix:
Mat Object: 4096 MPI processes
type: mpiaij
rows=1073741824, cols=1073741824
total: nonzeros=7516192768, allocated nonzeros=7516192768
total number of mallocs used during MatSetValues calls =0
has attached null space
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log_view.tar.gz
Type: application/gzip
Size: 25187 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161006/a198348a/attachment-0001.bin>
-------------- next part --------------
-ksp_type cg
-ksp_norm_type unpreconditioned
-ksp_rtol 1e-7
-options_left
-ksp_initial_guess_nonzero yes
-ksp_converged_reason
-ppe_max_iter 20
-pc_type mg
-pc_mg_galerkin
-pc_mg_levels 6
-mg_levels_ksp_type richardson
-mg_levels_ksp_max_it 1
-mg_coarse_ksp_type preonly
-mg_coarse_pc_type telescope
-mg_coarse_pc_telescope_reduction_factor 64
-matrap 0
-matptap_scalable
-memory_view
-log_view
-options_left 1
# Setting dmdarepart on subcomm
-mg_coarse_telescope_ksp_type preonly
-mg_coarse_telescope_pc_type mg
-mg_coarse_telescope_pc_mg_galerkin
-mg_coarse_telescope_pc_mg_levels 3
-mg_coarse_telescope_mg_levels_ksp_max_it 1
-mg_coarse_telescope_mg_levels_ksp_type richardson
-mg_coarse_telescope_mg_coarse_ksp_type preonly
-mg_coarse_telescope_mg_coarse_pc_type redundant
More information about the petsc-users
mailing list