[petsc-users] Question about memory usage in Multigrid preconditioner

Wed Jul 13 19:47:31 CDT 2016

On 14 July 2016 at 01:07, frank <hengjiew at uci.edu> wrote:

> Hi Dave,
>
> Sorry for the late reply.
> Thank you so much for your detailed reply.
>
> I have a question about the estimation of the memory usage. There are
> 4223139840 allocated non-zeros and 18432 MPI processes. Double precision is
> used. So the memory per process is:
>   4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?
> Did I do sth wrong here? Because this seems too small.
>

No - I totally f***ed it up. You are correct. That'll teach me for fumbling
around with my iphone calculator and not using my brain. (Note that to
convert to MB just divide by 1e6, not 1024^2 - although I apparently cannot
convert between units correctly....)

>From the PETSc objects associated with the solver, It looks like it
_should_ run with 2GB per MPI rank. Sorry for my mistake. Possibilities
are: somewhere in your usage of PETSc you've introduced a memory leak;
PETSc is doing a huge over allocation (e.g. as per our discussion of
MatPtAP); or in your application code there are other objects you have
forgotten to log the memory for.

> I am running this job on Bluewater
> <https://bluewaters.ncsa.illinois.edu/user-guide>
>
I am using the 7 points FD stencil in 3D.
>

I thought so on both counts.

>
> I apologize that I made a stupid mistake in computing the memory per core.
> My settings render each core can access only 2G memory on average instead
> of 8G which I mentioned in previous email. I re-run the job with 8G memory
> per core on average and there is no "Out Of Memory" error. I would do more
> test to see if there is still some memory issue.
>

Ok. I'd still like to know where the memory was being used since my
estimates were off.

Thanks,
  Dave

>
> Regards,
> Frank
>
>
>
> On 07/11/2016 01:18 PM, Dave May wrote:
>
> Hi Frank,
>
>
> On 11 July 2016 at 19:14, frank <hengjiew at uci.edu> wrote:
>
>> Hi Dave,
>>
>> I re-run the test using bjacobi as the preconditioner on the coarse mesh
>> of telescope. The Grid is 3072*256*768 and process mesh is 96*8*24. The
>> petsc option file is attached.
>> I still got the "Out Of Memory" error. The error occurred before the
>> linear solver finished one step. So I don't have the full info from
>> ksp_view. The info from ksp_view_pre is attached.
>>
>
> Okay - that is essentially useless (sorry)
>
>
>>
>> It seems to me that the error occurred when the decomposition was going
>> to be changed.
>>
>
> Based on what information?
> Running with -info would give us more clues, but will create a ton of
> output.
> Please try running the case which failed with -info
>
>
>> I had another test with a grid of 1536*128*384 and the same process mesh
>> as above. There was no error. The ksp_view info is attached for comparison.
>> Thank you.
>>
>
>
> [3] Here is my crude estimate of your memory usage.
> I'll target the biggest memory hogs only to get an order of magnitude
> estimate
>
> * The Fine grid operator contains 4223139840 non-zeros --> 1.8 GB per MPI
> rank assuming double precision.
> The indices for the AIJ could amount to another 0.3 GB (assuming 32 bit
> integers)
>
> * You use 5 levels of coarsening, so the other operators should represent
> (collectively)
> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300 MB per MPI rank on the
> communicator with 18432 ranks.
> The coarse grid should consume ~ 0.5 MB per MPI rank on the communicator
> with 18432 ranks.
>
> * You use a reduction factor of 64, making the new communicator with 288
> MPI ranks.
> PCTelescope will first gather a temporary matrix associated with your
> coarse level operator assuming a comm size of 288 living on the comm with
> size 18432.
> This matrix will require approximately 0.5 * 64 = 32 MB per core on the
> 288 ranks.
> This matrix is then used to form a new MPIAIJ matrix on the subcomm, thus
> require another 32 MB per rank.
> The temporary matrix is now destroyed.
>
> * Because a DMDA is detected, a permutation matrix is assembled.
> This requires 2 doubles per point in the DMDA.
> Your coarse DMDA contains 92 x 16 x 48 points.
> Thus the permutation matrix will require < 1 MB per MPI rank on the
> sub-comm.
>
> * Lastly, the matrix is permuted. This uses MatPtAP(), but the resulting
> operator will have the same memory footprint as the unpermuted matrix (32
> MB). At any stage in PCTelescope, only 2 operators of size 32 MB are held
> in memory when the DMDA is provided.
>
> From my rough estimates, the worst case memory foot print for any given
> core, given your options is approximately
> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB
> This is way below 8 GB.
>
> Note this estimate completely ignores:
> (1) the memory required for the restriction operator,
> (2) the potential growth in the number of non-zeros per row due to
> Galerkin coarsening (I wished -ksp_view_pre reported the output from
> MatView so we could see the number of non-zeros required by the coarse
> level operators)
> (3) all temporary vectors required by the CG solver, and those required by
> the smoothers.
> (4) internal memory allocated by MatPtAP
> (5) memory associated with IS's used within PCTelescope
>
> So either I am completely off in my estimates, or you have not carefully
> estimated the memory usage of your application code. Hopefully others might
> examine/correct my rough estimates
>
> Since I don't have your code I cannot access the latter.
> Since I don't have access to the same machine you are running on, I think
> we need to take a step back.
>
> [1] What machine are you running on? Send me a URL if its available
>
> [2] What discretization are you using? (I am guessing a scalar 7 point FD
> stencil)
> If it's a 7 point FD stencil, we should be able to examine the memory
> usage of your solver configuration using a standard, light weight existing
> PETSc example, run on your machine at the same scale.
> This would hopefully enable us to correctly evaluate the actual memory
> usage required by the solver configuration you are using.
>
> Thanks,
>   Dave
>
>
>>
>>
>> Frank
>>
>>
>>
>>
>> On 07/08/2016 10:38 PM, Dave May wrote:
>>
>>
>>
>> On Saturday, 9 July 2016, frank <hengjiew at uci.edu> wrote:
>>
>>> Hi Barry and Dave,
>>>
>>> Thank both of you for the advice.
>>>
>>> @Barry
>>> I made a mistake in the file names in last email. I attached the correct
>>> files this time.
>>> For all the three tests, 'Telescope' is used as the coarse
>>> preconditioner.
>>>
>>> == Test1:   Grid: 1536*128*384,   Process Mesh: 48*4*12
>>> Part of the memory usage:  Vector   125            124 3971904     0.
>>>                                              Matrix   101 101
>>> 9462372     0
>>>
>>> == Test2: Grid: 1536*128*384,   Process Mesh: 96*8*24
>>> Part of the memory usage:  Vector   125            124 681672     0.
>>>                                              Matrix   101 101
>>> 1462180     0.
>>>
>>> In theory, the memory usage in Test1 should be 8 times of Test2. In my
>>> case, it is about 6 times.
>>>
>>> == Test3: Grid: 3072*256*768,   Process Mesh: 96*8*24. Sub-domain per
>>> process: 32*32*32
>>> Here I get the out of memory error.
>>>
>>> I tried to use -mg_coarse jacobi. In this way, I don't need to set
>>> -mg_coarse_ksp_type and -mg_coarse_pc_type explicitly, right?
>>> The linear solver didn't work in this case. Petsc output some errors.
>>>
>>> @Dave
>>> In test3, I use only one instance of 'Telescope'. On the coarse mesh of
>>> 'Telescope', I used LU as the preconditioner instead of SVD.
>>> If my set the levels correctly, then on the last coarse mesh of MG where
>>> it calls 'Telescope', the sub-domain per process is 2*2*2.
>>> On the last coarse mesh of 'Telescope', there is only one grid point per
>>> process.
>>> I still got the OOM error. The detailed petsc option file is attached.
>>
>>
>> Do you understand the expected memory usage for the particular parallel
>> LU implementation you are using? I don't (seriously). Replace LU with
>> bjacobi and re-run this test. My point about solver debugging is still
>> valid.
>>
>> And please send the result of KSPView so we can see what is actually used
>> in the computations
>>
>> Thanks
>>   Dave
>>
>>
>>>
>>>
>>> Thank you so much.
>>>
>>> Frank
>>>
>>>
>>>
>>> On 07/06/2016 02:51 PM, Barry Smith wrote:
>>>
>>>> On Jul 6, 2016, at 4:19 PM, frank < <hengjiew at uci.edu>hengjiew at uci.edu>
>>>>> wrote:
>>>>>
>>>>> Hi Barry,
>>>>>
>>>>> Thank you for you advice.
>>>>> I tried three test. In the 1st test, the grid is 3072*256*768 and the
>>>>> process mesh is 96*8*24.
>>>>> The linear solver is 'cg' the preconditioner is 'mg' and 'telescope'
>>>>> is used as the preconditioner at the coarse mesh.
>>>>> The system gives me the "Out of Memory" error before the linear system
>>>>> is completely solved.
>>>>> The info from '-ksp_view_pre' is attached. I seems to me that the
>>>>> error occurs when it reaches the coarse mesh.
>>>>>
>>>>> The 2nd test uses a grid of 1536*128*384 and process mesh is 96*8*24.
>>>>> The 3rd test uses the same grid but a different process mesh 48*4*12.
>>>>>
>>>>     Are you sure this is right? The total matrix and vector memory
>>>> usage goes from 2nd test
>>>>                Vector   384            383      8,193,712     0.
>>>>                Matrix   103            103     11,508,688     0.
>>>> to 3rd test
>>>>               Vector   384            383      1,590,520     0.
>>>>                Matrix   103            103      3,508,664     0.
>>>> that is the memory usage got smaller but if you have only 1/8th the
>>>> processes and the same grid it should have gotten about 8 times bigger. Did
>>>> you maybe cut the grid by a factor of 8 also? If so that still doesn't
>>>> explain it because the memory usage changed by a factor of 5 something for
>>>> the vectors and 3 something for the matrices.
>>>>
>>>>
>>>> The linear solver and petsc options in 2nd and 3rd tests are the same
>>>>> in 1st test. The linear solver works fine in both test.
>>>>> I attached the memory usage of the 2nd and 3rd tests. The memory info
>>>>> is from the option '-log_summary'. I tried to use '-momery_info' as you
>>>>> suggested, but in my case petsc treated it as an unused option. It output
>>>>> nothing about the memory. Do I need to add sth to my code so I can use
>>>>> '-memory_info'?
>>>>>
>>>>     Sorry, my mistake the option is -memory_view
>>>>
>>>>    Can you run the one case with -memory_view and -mg_coarse jacobi
>>>> -ksp_max_it 1 (just so it doesn't iterate forever) to see how much memory
>>>> is used without the telescope? Also run case 2 the same way.
>>>>
>>>>    Barry
>>>>
>>>>
>>>>
>>>> In both tests the memory usage is not large.
>>>>>
>>>>> It seems to me that it might be the 'telescope'  preconditioner that
>>>>> allocated a lot of memory and caused the error in the 1st test.
>>>>> Is there is a way to show how much memory it allocated?
>>>>>
>>>>> Frank
>>>>>
>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote:
>>>>>
>>>>>>    Frank,
>>>>>>
>>>>>>      You can run with -ksp_view_pre to have it "view" the KSP before
>>>>>> the solve so hopefully it gets that far.
>>>>>>
>>>>>>       Please run the problem that does fit with -memory_info when the
>>>>>> problem completes it will show the "high water mark" for PETSc allocated
>>>>>> memory and total memory used. We first want to look at these numbers to see
>>>>>> if it is using more memory than you expect. You could also run with say
>>>>>> half the grid spacing to see how the memory usage scaled with the increase
>>>>>> in grid points. Make the runs also with -log_view and send all the output
>>>>>> from these options.
>>>>>>
>>>>>>     Barry
>>>>>>
>>>>>> On Jul 5, 2016, at 5:23 PM, frank < <hengjiew at uci.edu>
>>>>>>> hengjiew at uci.edu> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am using the CG ksp solver and Multigrid preconditioner  to solve
>>>>>>> a linear system in parallel.
>>>>>>> I chose to use the 'Telescope' as the preconditioner on the coarse
>>>>>>> mesh for its good performance.
>>>>>>> The petsc options file is attached.
>>>>>>>
>>>>>>> The domain is a 3d box.
>>>>>>> It works well when the grid is  1536*128*384 and the process mesh is
>>>>>>> 96*8*24. When I double the size of grid and keep the same process mesh and
>>>>>>> petsc options, I get an "out of memory" error from the super-cluster I am
>>>>>>> using.
>>>>>>> Each process has access to at least 8G memory, which should be more
>>>>>>> than enough for my application. I am sure that all the other parts of my
>>>>>>> code( except the linear solver ) do not use much memory. So I doubt if
>>>>>>> there is something wrong with the linear solver.
>>>>>>> The error occurs before the linear system is completely solved so I
>>>>>>> don't have the info from ksp view. I am not able to re-produce the error
>>>>>>> with a smaller problem either.
>>>>>>> In addition,  I tried to use the block jacobi as the preconditioner
>>>>>>> with the same grid and same decomposition. The linear solver runs extremely
>>>>>>> slow but there is no memory error.
>>>>>>>
>>>>>>> How can I diagnose what exactly cause the error?
>>>>>>> Thank you so much.
>>>>>>>
>>>>>>> Frank
>>>>>>> <petsc_options.txt>
>>>>>>>
>>>>>>
>>>>> <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>
>>>>>
>>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160714/7ac1c782/attachment-0001.html>