[petsc-users] Question about memory usage in Multigrid preconditioner

Wed Sep 14 20:44:03 CDT 2016

Hi Frank,

On Thursday, 15 September 2016, frank <hengjiew at uci.edu> wrote:

> Hi,
>
> I write a simple code to re-produce the error. I hope this can help to
> diagnose the problem.
> The code just solves a 3d poisson equation.
> I run the code on a 1024^3 mesh. The process partition is 32 * 32 * 32.
> That's when I re-produce the OOM error. Each core has about 2G memory.
> I also run the code on a 512^3 mesh with 16 * 16 * 16 processes. The ksp
> solver works fine.
>

Perfect! That's very helpful, I can use this to track down here the issue
is coming from. Give me some time to figure this out.

Thanks,
  Dave

>
> I attached the code, ksp_view_pre's output and my petsc option file.
>
> Thank you.
> Frank
>
> On 09/09/2016 06:38 PM, Hengjie Wang wrote:
>
> Hi Barry,
>
> I checked. On the supercomputer, I had the option "-ksp_view_pre" but it
> is not in file I sent you. I am sorry for the confusion.
>
> Regards,
> Frank
>
> On Friday, September 9, 2016, Barry Smith <bsmith at mcs.anl.gov
> <javascript:_e(%7B%7D,'cvml','bsmith at mcs.anl.gov');>> wrote:
>
>>
>> > On Sep 9, 2016, at 3:11 PM, frank <hengjiew at uci.edu> wrote:
>> >
>> > Hi Barry,
>> >
>> > I think the first KSP view output is from -ksp_view_pre. Before I
>> submitted the test, I was not sure whether there would be OOM error or not.
>> So I added both -ksp_view_pre and -ksp_view.
>>
>>   But the options file you sent specifically does NOT list the
>> -ksp_view_pre so how could it be from that?
>>
>>    Sorry to be pedantic but I've spent too much time in the past trying
>> to debug from incorrect information and want to make sure that the
>> information I have is correct before thinking. Please recheck exactly what
>> happened. Rerun with the exact input file you emailed if that is needed.
>>
>>    Barry
>>
>> >
>> > Frank
>> >
>> >
>> > On 09/09/2016 12:38 PM, Barry Smith wrote:
>> >>   Why does ksp_view2.txt have two KSP views in it while ksp_view1.txt
>> has only one KSPView in it? Did you run two different solves in the 2 case
>> but not the one?
>> >>
>> >>   Barry
>> >>
>> >>
>> >>
>> >>> On Sep 9, 2016, at 10:56 AM, frank <hengjiew at uci.edu> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I want to continue digging into the memory problem here.
>> >>> I did find a work around in the past, which is to use less cores per
>> node so that each core has 8G memory. However this is deficient and
>> expensive. I hope to locate the place that uses the most memory.
>> >>>
>> >>> Here is a brief summary of the tests I did in past:
>> >>>> Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12
>> >>> Maximum (over computational time) process memory:           total
>> 7.0727e+08
>> >>> Current process memory:
>>            total 7.0727e+08
>> >>> Maximum (over computational time) space PetscMalloc()ed:  total
>> 6.3908e+11
>> >>> Current space PetscMalloc()ed:
>>         total 1.8275e+09
>> >>>
>> >>>> Test2:    Mesh 1536*128*384  |  Process Mesh 96*8*24
>> >>> Maximum (over computational time) process memory:           total
>> 5.9431e+09
>> >>> Current process memory:
>>            total 5.9431e+09
>> >>> Maximum (over computational time) space PetscMalloc()ed:  total
>> 5.3202e+12
>> >>> Current space PetscMalloc()ed:
>>          total 5.4844e+09
>> >>>
>> >>>> Test3:    Mesh 3072*256*768  |  Process Mesh 96*8*24
>> >>>     OOM( Out Of Memory ) killer of the supercomputer terminated the
>> job during "KSPSolve".
>> >>>
>> >>> I attached the output of ksp_view( the third test's output is from
>> ksp_view_pre ), memory_view and also the petsc options.
>> >>>
>> >>> In all the tests, each core can access about 2G memory. In test3,
>> there are 4223139840 non-zeros in the matrix. This will consume about
>> 1.74M, using double precision. Considering some extra memory used to store
>> integer index, 2G memory should still be way enough.
>> >>>
>> >>> Is there a way to find out which part of KSPSolve uses the most
>> memory?
>> >>> Thank you so much.
>> >>>
>> >>> BTW, there are 4 options remains unused and I don't understand why
>> they are omitted:
>> >>> -mg_coarse_telescope_mg_coarse_ksp_type value: preonly
>> >>> -mg_coarse_telescope_mg_coarse_pc_type value: bjacobi
>> >>> -mg_coarse_telescope_mg_levels_ksp_max_it value: 1
>> >>> -mg_coarse_telescope_mg_levels_ksp_type value: richardson
>> >>>
>> >>>
>> >>> Regards,
>> >>> Frank
>> >>>
>> >>> On 07/13/2016 05:47 PM, Dave May wrote:
>> >>>>
>> >>>> On 14 July 2016 at 01:07, frank <hengjiew at uci.edu> wrote:
>> >>>> Hi Dave,
>> >>>>
>> >>>> Sorry for the late reply.
>> >>>> Thank you so much for your detailed reply.
>> >>>>
>> >>>> I have a question about the estimation of the memory usage. There
>> are 4223139840 allocated non-zeros and 18432 MPI processes. Double
>> precision is used. So the memory per process is:
>> >>>>   4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?
>> >>>> Did I do sth wrong here? Because this seems too small.
>> >>>>
>> >>>> No - I totally f***ed it up. You are correct. That'll teach me for
>> fumbling around with my iphone calculator and not using my brain. (Note
>> that to convert to MB just divide by 1e6, not 1024^2 - although I
>> apparently cannot convert between units correctly....)
>> >>>>
>> >>>> From the PETSc objects associated with the solver, It looks like it
>> _should_ run with 2GB per MPI rank. Sorry for my mistake. Possibilities
>> are: somewhere in your usage of PETSc you've introduced a memory leak;
>> PETSc is doing a huge over allocation (e.g. as per our discussion of
>> MatPtAP); or in your application code there are other objects you have
>> forgotten to log the memory for.
>> >>>>
>> >>>>
>> >>>>
>> >>>> I am running this job on Bluewater
>> >>>> I am using the 7 points FD stencil in 3D.
>> >>>>
>> >>>> I thought so on both counts.
>> >>>>
>> >>>> I apologize that I made a stupid mistake in computing the memory per
>> core. My settings render each core can access only 2G memory on average
>> instead of 8G which I mentioned in previous email. I re-run the job with 8G
>> memory per core on average and there is no "Out Of Memory" error. I would
>> do more test to see if there is still some memory issue.
>> >>>>
>> >>>> Ok. I'd still like to know where the memory was being used since my
>> estimates were off.
>> >>>>
>> >>>>
>> >>>> Thanks,
>> >>>>   Dave
>> >>>>
>> >>>> Regards,
>> >>>> Frank
>> >>>>
>> >>>>
>> >>>>
>> >>>> On 07/11/2016 01:18 PM, Dave May wrote:
>> >>>>> Hi Frank,
>> >>>>>
>> >>>>>
>> >>>>> On 11 July 2016 at 19:14, frank <hengjiew at uci.edu> wrote:
>> >>>>> Hi Dave,
>> >>>>>
>> >>>>> I re-run the test using bjacobi as the preconditioner on the coarse
>> mesh of telescope. The Grid is 3072*256*768 and process mesh is 96*8*24.
>> The petsc option file is attached.
>> >>>>> I still got the "Out Of Memory" error. The error occurred before
>> the linear solver finished one step. So I don't have the full info from
>> ksp_view. The info from ksp_view_pre is attached.
>> >>>>>
>> >>>>> Okay - that is essentially useless (sorry)
>> >>>>>
>> >>>>> It seems to me that the error occurred when the decomposition was
>> going to be changed.
>> >>>>>
>> >>>>> Based on what information?
>> >>>>> Running with -info would give us more clues, but will create a ton
>> of output.
>> >>>>> Please try running the case which failed with -info
>> >>>>>  I had another test with a grid of 1536*128*384 and the same
>> process mesh as above. There was no error. The ksp_view info is attached
>> for comparison.
>> >>>>> Thank you.
>> >>>>>
>> >>>>>
>> >>>>> [3] Here is my crude estimate of your memory usage.
>> >>>>> I'll target the biggest memory hogs only to get an order of
>> magnitude estimate
>> >>>>>
>> >>>>> * The Fine grid operator contains 4223139840 non-zeros --> 1.8 GB
>> per MPI rank assuming double precision.
>> >>>>> The indices for the AIJ could amount to another 0.3 GB (assuming 32
>> bit integers)
>> >>>>>
>> >>>>> * You use 5 levels of coarsening, so the other operators should
>> represent (collectively)
>> >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300 MB per MPI rank on the
>> communicator with 18432 ranks.
>> >>>>> The coarse grid should consume ~ 0.5 MB per MPI rank on the
>> communicator with 18432 ranks.
>> >>>>>
>> >>>>> * You use a reduction factor of 64, making the new communicator
>> with 288 MPI ranks.
>> >>>>> PCTelescope will first gather a temporary matrix associated with
>> your coarse level operator assuming a comm size of 288 living on the comm
>> with size 18432.
>> >>>>> This matrix will require approximately 0.5 * 64 = 32 MB per core on
>> the 288 ranks.
>> >>>>> This matrix is then used to form a new MPIAIJ matrix on the
>> subcomm, thus require another 32 MB per rank.
>> >>>>> The temporary matrix is now destroyed.
>> >>>>>
>> >>>>> * Because a DMDA is detected, a permutation matrix is assembled.
>> >>>>> This requires 2 doubles per point in the DMDA.
>> >>>>> Your coarse DMDA contains 92 x 16 x 48 points.
>> >>>>> Thus the permutation matrix will require < 1 MB per MPI rank on the
>> sub-comm.
>> >>>>>
>> >>>>> * Lastly, the matrix is permuted. This uses MatPtAP(), but the
>> resulting operator will have the same memory footprint as the unpermuted
>> matrix (32 MB). At any stage in PCTelescope, only 2 operators of size 32 MB
>> are held in memory when the DMDA is provided.
>> >>>>>
>> >>>>> From my rough estimates, the worst case memory foot print for any
>> given core, given your options is approximately
>> >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB
>> >>>>> This is way below 8 GB.
>> >>>>>
>> >>>>> Note this estimate completely ignores:
>> >>>>> (1) the memory required for the restriction operator,
>> >>>>> (2) the potential growth in the number of non-zeros per row due to
>> Galerkin coarsening (I wished -ksp_view_pre reported the output from
>> MatView so we could see the number of non-zeros required by the coarse
>> level operators)
>> >>>>> (3) all temporary vectors required by the CG solver, and those
>> required by the smoothers.
>> >>>>> (4) internal memory allocated by MatPtAP
>> >>>>> (5) memory associated with IS's used within PCTelescope
>> >>>>>
>> >>>>> So either I am completely off in my estimates, or you have not
>> carefully estimated the memory usage of your application code. Hopefully
>> others might examine/correct my rough estimates
>> >>>>>
>> >>>>> Since I don't have your code I cannot access the latter.
>> >>>>> Since I don't have access to the same machine you are running on, I
>> think we need to take a step back.
>> >>>>>
>> >>>>> [1] What machine are you running on? Send me a URL if its available
>> >>>>>
>> >>>>> [2] What discretization are you using? (I am guessing a scalar 7
>> point FD stencil)
>> >>>>> If it's a 7 point FD stencil, we should be able to examine the
>> memory usage of your solver configuration using a standard, light weight
>> existing PETSc example, run on your machine at the same scale.
>> >>>>> This would hopefully enable us to correctly evaluate the actual
>> memory usage required by the solver configuration you are using.
>> >>>>>
>> >>>>> Thanks,
>> >>>>>   Dave
>> >>>>>
>> >>>>>
>> >>>>> Frank
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On 07/08/2016 10:38 PM, Dave May wrote:
>> >>>>>>
>> >>>>>> On Saturday, 9 July 2016, frank <hengjiew at uci.edu> wrote:
>> >>>>>> Hi Barry and Dave,
>> >>>>>>
>> >>>>>> Thank both of you for the advice.
>> >>>>>>
>> >>>>>> @Barry
>> >>>>>> I made a mistake in the file names in last email. I attached the
>> correct files this time.
>> >>>>>> For all the three tests, 'Telescope' is used as the coarse
>> preconditioner.
>> >>>>>>
>> >>>>>> == Test1:   Grid: 1536*128*384,   Process Mesh: 48*4*12
>> >>>>>> Part of the memory usage:  Vector   125            124 3971904
>>  0.
>> >>>>>>                                              Matrix   101 101
>> 9462372     0
>> >>>>>>
>> >>>>>> == Test2: Grid: 1536*128*384,   Process Mesh: 96*8*24
>> >>>>>> Part of the memory usage:  Vector   125            124 681672
>>  0.
>> >>>>>>                                              Matrix   101 101
>> 1462180     0.
>> >>>>>>
>> >>>>>> In theory, the memory usage in Test1 should be 8 times of Test2.
>> In my case, it is about 6 times.
>> >>>>>>
>> >>>>>> == Test3: Grid: 3072*256*768,   Process Mesh: 96*8*24. Sub-domain
>> per process: 32*32*32
>> >>>>>> Here I get the out of memory error.
>> >>>>>>
>> >>>>>> I tried to use -mg_coarse jacobi. In this way, I don't need to set
>> -mg_coarse_ksp_type and -mg_coarse_pc_type explicitly, right?
>> >>>>>> The linear solver didn't work in this case. Petsc output some
>> errors.
>> >>>>>>
>> >>>>>> @Dave
>> >>>>>> In test3, I use only one instance of 'Telescope'. On the coarse
>> mesh of 'Telescope', I used LU as the preconditioner instead of SVD.
>> >>>>>> If my set the levels correctly, then on the last coarse mesh of MG
>> where it calls 'Telescope', the sub-domain per process is 2*2*2.
>> >>>>>> On the last coarse mesh of 'Telescope', there is only one grid
>> point per process.
>> >>>>>> I still got the OOM error. The detailed petsc option file is
>> attached.
>> >>>>>>
>> >>>>>> Do you understand the expected memory usage for the particular
>> parallel LU implementation you are using? I don't (seriously). Replace LU
>> with bjacobi and re-run this test. My point about solver debugging is still
>> valid.
>> >>>>>>
>> >>>>>> And please send the result of KSPView so we can see what is
>> actually used in the computations
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>>   Dave
>> >>>>>>
>> >>>>>>
>> >>>>>> Thank you so much.
>> >>>>>>
>> >>>>>> Frank
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:
>> >>>>>> On Jul 6, 2016, at 4:19 PM, frank <hengjiew at uci.edu> wrote:
>> >>>>>>
>> >>>>>> Hi Barry,
>> >>>>>>
>> >>>>>> Thank you for you advice.
>> >>>>>> I tried three test. In the 1st test, the grid is 3072*256*768 and
>> the process mesh is 96*8*24.
>> >>>>>> The linear solver is 'cg' the preconditioner is 'mg' and
>> 'telescope' is used as the preconditioner at the coarse mesh.
>> >>>>>> The system gives me the "Out of Memory" error before the linear
>> system is completely solved.
>> >>>>>> The info from '-ksp_view_pre' is attached. I seems to me that the
>> error occurs when it reaches the coarse mesh.
>> >>>>>>
>> >>>>>> The 2nd test uses a grid of 1536*128*384 and process mesh is
>> 96*8*24. The 3rd                                             test uses the
>> same grid but a different process mesh 48*4*12.
>> >>>>>>     Are you sure this is right? The total matrix and vector memory
>> usage goes from 2nd test
>> >>>>>>                Vector   384            383      8,193,712     0.
>> >>>>>>                Matrix   103            103     11,508,688     0.
>> >>>>>> to 3rd test
>> >>>>>>               Vector   384            383      1,590,520     0.
>> >>>>>>                Matrix   103            103      3,508,664     0.
>> >>>>>> that is the memory usage got smaller but if you have only 1/8th
>> the processes and the same grid it should have gotten about 8 times bigger.
>> Did you maybe cut the grid by a factor of 8 also? If so that still doesn't
>> explain it because the memory usage changed by a factor of 5 something for
>> the vectors and 3 something for the matrices.
>> >>>>>>
>> >>>>>>
>> >>>>>> The linear solver and petsc options in 2nd and 3rd tests are the
>> same in 1st test. The linear solver works fine in both test.
>> >>>>>> I attached the memory usage of the 2nd and 3rd tests. The memory
>> info is from the option '-log_summary'. I tried to use '-momery_info' as
>> you suggested, but in my case petsc treated it as an unused option. It
>> output nothing about the memory. Do I need to add sth to my code so I can
>> use '-memory_info'?
>> >>>>>>     Sorry, my mistake the option is -memory_view
>> >>>>>>
>> >>>>>>    Can you run the one case with -memory_view and -mg_coarse
>> jacobi -ksp_max_it 1 (just so it doesn't iterate forever) to see how much
>> memory is used without the telescope? Also run case 2 the same way.
>> >>>>>>
>> >>>>>>    Barry
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> In both tests the memory usage is not large.
>> >>>>>>
>> >>>>>> It seems to me that it might be the 'telescope'  preconditioner
>> that allocated a lot of memory and caused the error in the 1st test.
>> >>>>>> Is there is a way to show how much memory it allocated?
>> >>>>>>
>> >>>>>> Frank
>> >>>>>>
>> >>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote:
>> >>>>>>    Frank,
>> >>>>>>
>> >>>>>>      You can run with -ksp_view_pre to have it "view" the KSP
>> before the solve so hopefully it gets that far.
>> >>>>>>
>> >>>>>>       Please run the problem that does fit with -memory_info when
>> the problem completes it will show the "high water mark" for PETSc
>> allocated memory and total memory used. We first want to look at these
>> numbers to see if it is using more memory than you expect. You could also
>> run with say half the grid spacing to see how the memory usage scaled with
>> the increase in grid points. Make the runs also with -log_view and send all
>> the output from these options.
>> >>>>>>
>> >>>>>>     Barry
>> >>>>>>
>> >>>>>> On Jul 5, 2016, at 5:23 PM, frank <hengjiew at uci.edu> wrote:
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I am using the CG ksp solver and Multigrid preconditioner  to
>> solve a linear system in parallel.
>> >>>>>> I chose to use the 'Telescope' as the preconditioner on the coarse
>> mesh for its good performance.
>> >>>>>> The petsc options file is attached.
>> >>>>>>
>> >>>>>> The domain is a 3d box.
>> >>>>>> It works well when the grid is  1536*128*384 and the process mesh
>> is 96*8*24. When I double the size of grid and
>>                    keep the same process mesh and petsc options, I get an
>> "out of memory" error from the super-cluster I am using.
>> >>>>>> Each process has access to at least 8G memory, which should be
>> more than enough for my application. I am sure that all the other parts of
>> my code( except the linear solver ) do not use much memory. So I doubt if
>> there is something wrong with the linear solver.
>> >>>>>> The error occurs before the linear system is completely solved so
>> I don't have the info from ksp view. I am not able to re-produce the error
>> with a smaller problem either.
>> >>>>>> In addition,  I tried to use the block jacobi as the
>> preconditioner with the same grid and same decomposition. The linear solver
>> runs extremely slow but there is no memory error.
>> >>>>>>
>> >>>>>> How can I diagnose what exactly cause the error?
>> >>>>>> Thank you so much.
>> >>>>>>
>> >>>>>> Frank
>> >>>>>> <petsc_options.txt>
>> >>>>>> <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc
>> _options.txt>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>> <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><m
>> emory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt>
>> >
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160915/3fa11ddc/attachment-0001.html>