[petsc-users] Question about memory usage in Multigrid preconditioner
frank
hengjiew at uci.edu
Wed Jul 13 18:07:51 CDT 2016
Hi Dave,
Sorry for the late reply.
Thank you so much for your detailed reply.
I have a question about the estimation of the memory usage. There are
4223139840 allocated non-zeros and 18432 MPI processes. Double precision
is used. So the memory per process is:
4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?
Did I do sth wrong here? Because this seems too small.
I am running this job on Bluewater
<https://bluewaters.ncsa.illinois.edu/user-guide>
I am using the 7 points FD stencil in 3D.
I apologize that I made a stupid mistake in computing the memory per
core. My settings render each core can access only 2G memory on average
instead of 8G which I mentioned in previous email. I re-run the job with
8G memory per core on average and there is no "Out Of Memory" error. I
would do more test to see if there is still some memory issue.
Regards,
Frank
On 07/11/2016 01:18 PM, Dave May wrote:
> Hi Frank,
>
>
> On 11 July 2016 at 19:14, frank <hengjiew at uci.edu
> <mailto:hengjiew at uci.edu>> wrote:
>
> Hi Dave,
>
> I re-run the test using bjacobi as the preconditioner on the
> coarse mesh of telescope. The Grid is 3072*256*768 and process
> mesh is 96*8*24. The petsc option file is attached.
> I still got the "Out Of Memory" error. The error occurred before
> the linear solver finished one step. So I don't have the full info
> from ksp_view. The info from ksp_view_pre is attached.
>
>
> Okay - that is essentially useless (sorry)
>
>
> It seems to me that the error occurred when the decomposition was
> going to be changed.
>
>
> Based on what information?
> Running with -info would give us more clues, but will create a ton of
> output.
> Please try running the case which failed with -info
>
> I had another test with a grid of 1536*128*384 and the same
> process mesh as above. There was no error. The ksp_view info is
> attached for comparison.
> Thank you.
>
>
>
> [3] Here is my crude estimate of your memory usage.
> I'll target the biggest memory hogs only to get an order of magnitude
> estimate
>
> * The Fine grid operator contains 4223139840 non-zeros --> 1.8 GB per
> MPI rank assuming double precision.
> The indices for the AIJ could amount to another 0.3 GB (assuming 32
> bit integers)
>
> * You use 5 levels of coarsening, so the other operators should
> represent (collectively)
> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4 ~ 300 MB per MPI rank on the
> communicator with 18432 ranks.
> The coarse grid should consume ~ 0.5 MB per MPI rank on the
> communicator with 18432 ranks.
>
> * You use a reduction factor of 64, making the new communicator with
> 288 MPI ranks.
> PCTelescope will first gather a temporary matrix associated with your
> coarse level operator assuming a comm size of 288 living on the comm
> with size 18432.
> This matrix will require approximately 0.5 * 64 = 32 MB per core on
> the 288 ranks.
> This matrix is then used to form a new MPIAIJ matrix on the subcomm,
> thus require another 32 MB per rank.
> The temporary matrix is now destroyed.
>
> * Because a DMDA is detected, a permutation matrix is assembled.
> This requires 2 doubles per point in the DMDA.
> Your coarse DMDA contains 92 x 16 x 48 points.
> Thus the permutation matrix will require < 1 MB per MPI rank on the
> sub-comm.
>
> * Lastly, the matrix is permuted. This uses MatPtAP(), but the
> resulting operator will have the same memory footprint as the
> unpermuted matrix (32 MB). At any stage in PCTelescope, only 2
> operators of size 32 MB are held in memory when the DMDA is provided.
>
> From my rough estimates, the worst case memory foot print for any
> given core, given your options is approximately
> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB = 2465 MB
> This is way below 8 GB.
>
> Note this estimate completely ignores:
> (1) the memory required for the restriction operator,
> (2) the potential growth in the number of non-zeros per row due to
> Galerkin coarsening (I wished -ksp_view_pre reported the output from
> MatView so we could see the number of non-zeros required by the coarse
> level operators)
> (3) all temporary vectors required by the CG solver, and those
> required by the smoothers.
> (4) internal memory allocated by MatPtAP
> (5) memory associated with IS's used within PCTelescope
>
> So either I am completely off in my estimates, or you have not
> carefully estimated the memory usage of your application code.
> Hopefully others might examine/correct my rough estimates
>
> Since I don't have your code I cannot access the latter.
> Since I don't have access to the same machine you are running on, I
> think we need to take a step back.
>
> [1] What machine are you running on? Send me a URL if its available
>
> [2] What discretization are you using? (I am guessing a scalar 7 point
> FD stencil)
> If it's a 7 point FD stencil, we should be able to examine the memory
> usage of your solver configuration using a standard, light weight
> existing PETSc example, run on your machine at the same scale.
> This would hopefully enable us to correctly evaluate the actual memory
> usage required by the solver configuration you are using.
>
> Thanks,
> Dave
>
>
>
> Frank
>
>
>
>
> On 07/08/2016 10:38 PM, Dave May wrote:
>>
>>
>> On Saturday, 9 July 2016, frank <hengjiew at uci.edu> wrote:
>>
>> Hi Barry and Dave,
>>
>> Thank both of you for the advice.
>>
>> @Barry
>> I made a mistake in the file names in last email. I attached
>> the correct files this time.
>> For all the three tests, 'Telescope' is used as the coarse
>> preconditioner.
>>
>> == Test1: Grid: 1536*128*384, Process Mesh: 48*4*12
>> Part of the memory usage: Vector 125 124 3971904 0.
>> Matrix 101 101 9462372 0
>>
>> == Test2: Grid: 1536*128*384, Process Mesh: 96*8*24
>> Part of the memory usage: Vector 125 124 681672 0.
>> Matrix 101 101 1462180 0.
>>
>> In theory, the memory usage in Test1 should be 8 times of
>> Test2. In my case, it is about 6 times.
>>
>> == Test3: Grid: 3072*256*768, Process Mesh: 96*8*24.
>> Sub-domain per process: 32*32*32
>> Here I get the out of memory error.
>>
>> I tried to use -mg_coarse jacobi. In this way, I don't need
>> to set -mg_coarse_ksp_type and -mg_coarse_pc_type explicitly,
>> right?
>> The linear solver didn't work in this case. Petsc output some
>> errors.
>>
>> @Dave
>> In test3, I use only one instance of 'Telescope'. On the
>> coarse mesh of 'Telescope', I used LU as the preconditioner
>> instead of SVD.
>> If my set the levels correctly, then on the last coarse mesh
>> of MG where it calls 'Telescope', the sub-domain per process
>> is 2*2*2.
>> On the last coarse mesh of 'Telescope', there is only one
>> grid point per process.
>> I still got the OOM error. The detailed petsc option file is
>> attached.
>>
>>
>> Do you understand the expected memory usage for the
>> particular parallel LU implementation you are using? I don't
>> (seriously). Replace LU with bjacobi and re-run this test. My
>> point about solver debugging is still valid.
>>
>> And please send the result of KSPView so we can see what is
>> actually used in the computations
>>
>> Thanks
>> Dave
>>
>>
>>
>> Thank you so much.
>>
>> Frank
>>
>>
>>
>> On 07/06/2016 02:51 PM, Barry Smith wrote:
>>
>> On Jul 6, 2016, at 4:19 PM, frank <hengjiew at uci.edu
>> <mailto:hengjiew at uci.edu>> wrote:
>>
>> Hi Barry,
>>
>> Thank you for you advice.
>> I tried three test. In the 1st test, the grid is
>> 3072*256*768 and the process mesh is 96*8*24.
>> The linear solver is 'cg' the preconditioner is 'mg'
>> and 'telescope' is used as the preconditioner at the
>> coarse mesh.
>> The system gives me the "Out of Memory" error before
>> the linear system is completely solved.
>> The info from '-ksp_view_pre' is attached. I seems to
>> me that the error occurs when it reaches the coarse mesh.
>>
>> The 2nd test uses a grid of 1536*128*384 and process
>> mesh is 96*8*24. The 3rd test uses the same grid but
>> a different process mesh 48*4*12.
>>
>> Are you sure this is right? The total matrix and
>> vector memory usage goes from 2nd test
>> Vector 384 383 8,193,712 0.
>> Matrix 103 103 11,508,688 0.
>> to 3rd test
>> Vector 384 383 1,590,520 0.
>> Matrix 103 103 3,508,664 0.
>> that is the memory usage got smaller but if you have only
>> 1/8th the processes and the same grid it should have
>> gotten about 8 times bigger. Did you maybe cut the grid
>> by a factor of 8 also? If so that still doesn't explain
>> it because the memory usage changed by a factor of 5
>> something for the vectors and 3 something for the matrices.
>>
>>
>> The linear solver and petsc options in 2nd and 3rd
>> tests are the same in 1st test. The linear solver
>> works fine in both test.
>> I attached the memory usage of the 2nd and 3rd tests.
>> The memory info is from the option '-log_summary'. I
>> tried to use '-momery_info' as you suggested, but in
>> my case petsc treated it as an unused option. It
>> output nothing about the memory. Do I need to add sth
>> to my code so I can use '-memory_info'?
>>
>> Sorry, my mistake the option is -memory_view
>>
>> Can you run the one case with -memory_view and
>> -mg_coarse jacobi -ksp_max_it 1 (just so it doesn't
>> iterate forever) to see how much memory is used without
>> the telescope? Also run case 2 the same way.
>>
>> Barry
>>
>>
>>
>> In both tests the memory usage is not large.
>>
>> It seems to me that it might be the 'telescope'
>> preconditioner that allocated a lot of memory and
>> caused the error in the 1st test.
>> Is there is a way to show how much memory it allocated?
>>
>> Frank
>>
>> On 07/05/2016 03:37 PM, Barry Smith wrote:
>>
>> Frank,
>>
>> You can run with -ksp_view_pre to have it
>> "view" the KSP before the solve so hopefully it
>> gets that far.
>>
>> Please run the problem that does fit with
>> -memory_info when the problem completes it will
>> show the "high water mark" for PETSc allocated
>> memory and total memory used. We first want to
>> look at these numbers to see if it is using more
>> memory than you expect. You could also run with
>> say half the grid spacing to see how the memory
>> usage scaled with the increase in grid points.
>> Make the runs also with -log_view and send all
>> the output from these options.
>>
>> Barry
>>
>> On Jul 5, 2016, at 5:23 PM, frank
>> <hengjiew at uci.edu <mailto:hengjiew at uci.edu>>
>> wrote:
>>
>> Hi,
>>
>> I am using the CG ksp solver and Multigrid
>> preconditioner to solve a linear system in
>> parallel.
>> I chose to use the 'Telescope' as the
>> preconditioner on the coarse mesh for its
>> good performance.
>> The petsc options file is attached.
>>
>> The domain is a 3d box.
>> It works well when the grid is 1536*128*384
>> and the process mesh is 96*8*24. When I
>> double the size of grid and keep the same
>> process mesh and petsc options, I get an "out
>> of memory" error from the super-cluster I am
>> using.
>> Each process has access to at least 8G
>> memory, which should be more than enough for
>> my application. I am sure that all the other
>> parts of my code( except the linear solver )
>> do not use much memory. So I doubt if there
>> is something wrong with the linear solver.
>> The error occurs before the linear system is
>> completely solved so I don't have the info
>> from ksp view. I am not able to re-produce
>> the error with a smaller problem either.
>> In addition, I tried to use the block jacobi
>> as the preconditioner with the same grid and
>> same decomposition. The linear solver runs
>> extremely slow but there is no memory error.
>>
>> How can I diagnose what exactly cause the error?
>> Thank you so much.
>>
>> Frank
>> <petsc_options.txt>
>>
>> <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160713/ebd17769/attachment-0001.html>
More information about the petsc-users
mailing list