[petsc-users] Performance of the Telescope Multigrid Preconditioner

Tue Oct 4 15:26:09 CDT 2016

On 10/04/2016 01:20 PM, Matthew Knepley wrote:
> On Tue, Oct 4, 2016 at 3:09 PM, frank <hengjiew at uci.edu 
> <mailto:hengjiew at uci.edu>> wrote:
>
>     Hi Dave,
>
>     Thank you for the reply.
>     What do you mean by the "nested calls to KSPSolve"?
>
>
> KSPSolve is called again after redistributing the computation.

I am still confused. There is only one KSPSolve in my code.
Do you mean KSPSolve is called again in the sub-communicator? If that's 
the case, even if I put two identical KSPSolve in the code, the 
sub-communicator is still going to call KSPSolve, right?

>     I tried to call KSPSolve twice, but the the second solve converged
>     in 0 iteration. KSPSolve seems to remember the solution. How can I
>     force both solves start from the same initial guess?
>
>
> Did you zero the solution vector between solves? VecSet(x, 0.0);
>
>   Matt
>
>     Thank you.
>
>     Frank
>
>
>
>     On 10/04/2016 12:56 PM, Dave May wrote:
>>
>>
>>     On Tuesday, 4 October 2016, frank <hengjiew at uci.edu
>>     <mailto:hengjiew at uci.edu>> wrote:
>>
>>         Hi,
>>
>>         This question is follow-up of the thread "Question about
>>         memory usage in Multigrid preconditioner".
>>         I used to have the "Out of Memory(OOM)" problem when using
>>         the CG+Telescope MG solver with 32768 cores. Adding the
>>         "-matrap 0; -matptap_scalable" option did solve that problem.
>>
>>         Then I test the scalability by solving a 3d poisson eqn for 1
>>         step. I used one sub-communicator in all the tests. The
>>         difference between the petsc options in those tests are: 1
>>         the pc_telescope_reduction_factor; 2 the number of multigrid
>>         levels in the up/down solver. The function "ksp_solve" is
>>         timed. It is kind of slow and doesn't scale at all.
>>
>>         Test1: 512^3 grid points
>>         Core#        telescope_reduction_factor MG levels# for
>>         up/down solver     Time for KSPSolve (s)
>>         512             8 4 / 3 6.2466
>>         4096           64 5 / 3 0.9361
>>         32768         64 4 / 3 4.8914
>>
>>         Test2: 1024^3 grid points
>>         Core#        telescope_reduction_factor MG levels# for
>>         up/down solver     Time for KSPSolve (s)
>>         4096           64 5 / 4 3.4139
>>         8192           128 5 / 4 2.4196
>>         16384         32        5 / 3 5.4150
>>         32768         64 5 / 3 5.6067
>>         65536         128 5 / 3 6.5219
>>
>>
>>     You have to be very careful how you interpret these numbers. Your
>>     solver contains nested calls to KSPSolve, and unfortunately as a
>>     result the numbers you report include setup time. This will
>>     remain true even if you call KSPSetUp on the outermost KSP.
>>
>>     Your email concerns scalability of the silver application, so
>>     let's focus on that issue.
>>
>>     The only way to clearly separate setup from solve time is
>>     to perform two identical solves. The second solve will not
>>     require any setup. You should monitor the second solve via a new
>>     PetscStage.
>>
>>     This was what I did in the telescope paper. It was the only way
>>     to understand the setup cost (and scaling) cf the solve time (and
>>     scaling).
>>
>>     Thanks
>>       Dave
>>
>>         I guess I didn't set the MG levels properly. What would be
>>         the efficient way to arrange the MG levels?
>>         Also which preconditionr at the coarse mesh of the 2nd
>>         communicator should I use to improve the performance?
>>
>>         I attached the test code and the petsc options file for the
>>         1024^3 cube with 32768 cores.
>>
>>         Thank you.
>>
>>         Regards,
>>         Frank
>>
>>
>>
>>
>>
>>
>>         On 09/15/2016 03:35 AM, Dave May wrote:
>>>         HI all,
>>>
>>>         I the only unexpected memory usage I can see is associated
>>>         with the call to MatPtAP().
>>>         Here is something you can try immediately.
>>>         Run your code with the additional options
>>>           -matrap 0 -matptap_scalable
>>>
>>>         I didn't realize this before, but the default behaviour of
>>>         MatPtAP in parallel is actually to to explicitly form the
>>>         transpose of P (e.g. assemble R = P^T) and then compute R.A.P.
>>>         You don't want to do this. The option -matrap 0 resolves
>>>         this issue.
>>>
>>>         The implementation of P^T.A.P has two variants.
>>>         The scalable implementation (with respect to memory usage)
>>>         is selected via the second option -matptap_scalable.
>>>
>>>         Try it out - I see a significant memory reduction using
>>>         these options for particular mesh sizes / partitions.
>>>
>>>         I've attached a cleaned up version of the code you sent me.
>>>         There were a number of memory leaks and other issues.
>>>         The main points being
>>>           * You should call DMDAVecGetArrayF90() before
>>>         VecAssembly{Begin,End}
>>>           * You should call PetscFinalize(), otherwise the option
>>>         -log_summary (-log_view) will not display anything once the
>>>         program has completed.
>>>
>>>
>>>         Thanks,
>>>           Dave
>>>
>>>
>>>         On 15 September 2016 at 08:03, Hengjie Wang
>>>         <hengjiew at uci.edu> wrote:
>>>
>>>             Hi Dave,
>>>
>>>             Sorry, I should have put more comment to explain the code.
>>>             The number of process in each dimension is the same: Px
>>>             = Py=Pz=P. So is the domain size.
>>>             So if the you want to run the code for a  512^3 grid
>>>             points on 16^3 cores, you need to set "-N 512 -P 16" in
>>>             the command line.
>>>             I add more comments and also fix an error in the
>>>             attached code. ( The error only effects the accuracy of
>>>             solution but not the memory usage. )
>>>
>>>             Thank you.
>>>             Frank
>>>
>>>
>>>             On 9/14/2016 9:05 PM, Dave May wrote:
>>>>
>>>>
>>>>             On Thursday, 15 September 2016, Dave May
>>>>             <dave.mayhem23 at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>                 On Thursday, 15 September 2016, frank
>>>>                 <hengjiew at uci.edu> wrote:
>>>>
>>>>                     Hi,
>>>>
>>>>                     I write a simple code to re-produce the error.
>>>>                     I hope this can help to diagnose the problem.
>>>>                     The code just solves a 3d poisson equation.
>>>>
>>>>
>>>>                 Why is the stencil width a runtime parameter?? And
>>>>                 why is the default value 2? For 7-pnt FD Laplace,
>>>>                 you only need a stencil width of 1.
>>>>
>>>>                 Was this choice made to mimic something in the
>>>>                 real application code?
>>>>
>>>>
>>>>             Please ignore - I misunderstood your usage of the param
>>>>             set by -P
>>>>
>>>>
>>>>                     I run the code on a 1024^3 mesh. The process
>>>>                     partition is 32 * 32 * 32. That's when I
>>>>                     re-produce the OOM error. Each core has about
>>>>                     2G memory.
>>>>                     I also run the code on a 512^3 mesh with 16 *
>>>>                     16 * 16 processes. The ksp solver works fine.
>>>>                     I attached the code, ksp_view_pre's output and
>>>>                     my petsc option file.
>>>>
>>>>                     Thank you.
>>>>                     Frank
>>>>
>>>>                     On 09/09/2016 06:38 PM, Hengjie Wang wrote:
>>>>>                     Hi Barry,
>>>>>
>>>>>                     I checked. On the supercomputer, I had the
>>>>>                     option "-ksp_view_pre" but it is not in file I
>>>>>                     sent you. I am sorry for the confusion.
>>>>>
>>>>>                     Regards,
>>>>>                     Frank
>>>>>
>>>>>                     On Friday, September 9, 2016, Barry Smith
>>>>>                     <bsmith at mcs.anl.gov> wrote:
>>>>>
>>>>>
>>>>>                         > On Sep 9, 2016, at 3:11 PM, frank
>>>>>                         <hengjiew at uci.edu> wrote:
>>>>>                         >
>>>>>                         > Hi Barry,
>>>>>                         >
>>>>>                         > I think the first KSP view output is
>>>>>                         from -ksp_view_pre. Before I submitted the
>>>>>                         test, I was not sure whether there would
>>>>>                         be OOM error or not. So I added both
>>>>>                         -ksp_view_pre and -ksp_view.
>>>>>
>>>>>                           But the options file you sent
>>>>>                         specifically does NOT list the
>>>>>                         -ksp_view_pre so how could it be from that?
>>>>>
>>>>>                            Sorry to be pedantic but I've spent too
>>>>>                         much time in the past trying to debug from
>>>>>                         incorrect information and want to make
>>>>>                         sure that the information I have is
>>>>>                         correct before thinking. Please recheck
>>>>>                         exactly what happened. Rerun with the
>>>>>                         exact input file you emailed if that is
>>>>>                         needed.
>>>>>
>>>>>                            Barry
>>>>>
>>>>>                         >
>>>>>                         > Frank
>>>>>                         >
>>>>>                         >
>>>>>                         > On 09/09/2016 12:38 PM, Barry Smith wrote:
>>>>>                         >>   Why does ksp_view2.txt have two KSP
>>>>>                         views in it while ksp_view1.txt has only
>>>>>                         one KSPView in it? Did you run two
>>>>>                         different solves in the 2 case but not the
>>>>>                         one?
>>>>>                         >>
>>>>>                         >>  Barry
>>>>>                         >>
>>>>>                         >>
>>>>>                         >>
>>>>>                         >>> On Sep 9, 2016, at 10:56 AM, frank
>>>>>                         <hengjiew at uci.edu> wrote:
>>>>>                         >>>
>>>>>                         >>> Hi,
>>>>>                         >>>
>>>>>                         >>> I want to continue digging into the
>>>>>                         memory problem here.
>>>>>                         >>> I did find a work around in the past,
>>>>>                         which is to use less cores per node so
>>>>>                         that each core has 8G memory. However this
>>>>>                         is deficient and expensive. I hope to
>>>>>                         locate the place that uses the most memory.
>>>>>                         >>>
>>>>>                         >>> Here is a brief summary of the tests I
>>>>>                         did in past:
>>>>>                         >>>> Test1:   Mesh 1536*128*384  | 
>>>>>                         Process Mesh 48*4*12
>>>>>                         >>> Maximum (over computational time)
>>>>>                         process memory:    total 7.0727e+08
>>>>>                         >>> Current process memory:        total
>>>>>                         7.0727e+08
>>>>>                         >>> Maximum (over computational time)
>>>>>                         space PetscMalloc()ed: total 6.3908e+11
>>>>>                         >>> Current space PetscMalloc()ed:       
>>>>>                                                                
>>>>>                         total 1.8275e+09
>>>>>                         >>>
>>>>>                         >>>> Test2:    Mesh 1536*128*384  | 
>>>>>                         Process Mesh 96*8*24
>>>>>                         >>> Maximum (over computational time)
>>>>>                         process memory:    total 5.9431e+09
>>>>>                         >>> Current process memory:        total
>>>>>                         5.9431e+09
>>>>>                         >>> Maximum (over computational time)
>>>>>                         space PetscMalloc()ed: total 5.3202e+12
>>>>>                         >>> Current space PetscMalloc()ed:       
>>>>>                                                                
>>>>>                          total 5.4844e+09
>>>>>                         >>>
>>>>>                         >>>> Test3:    Mesh 3072*256*768  | 
>>>>>                         Process Mesh 96*8*24
>>>>>                         >>>    OOM( Out Of Memory ) killer of the
>>>>>                         supercomputer terminated the job during
>>>>>                         "KSPSolve".
>>>>>                         >>>
>>>>>                         >>> I attached the output of ksp_view( the
>>>>>                         third test's output is from ksp_view_pre
>>>>>                         ), memory_view and also the petsc options.
>>>>>                         >>>
>>>>>                         >>> In all the tests, each core can access
>>>>>                         about 2G memory. In test3, there are
>>>>>                         4223139840 non-zeros in the matrix. This
>>>>>                         will consume about 1.74M, using double
>>>>>                         precision. Considering some extra memory
>>>>>                         used to store integer index, 2G memory
>>>>>                         should still be way enough.
>>>>>                         >>>
>>>>>                         >>> Is there a way to find out which part
>>>>>                         of KSPSolve uses the most memory?
>>>>>                         >>> Thank you so much.
>>>>>                         >>>
>>>>>                         >>> BTW, there are 4 options remains
>>>>>                         unused and I don't understand why they are
>>>>>                         omitted:
>>>>>                         >>>
>>>>>                         -mg_coarse_telescope_mg_coarse_ksp_type
>>>>>                         value: preonly
>>>>>                         >>> -mg_coarse_telescope_mg_coarse_pc_type
>>>>>                         value: bjacobi
>>>>>                         >>>
>>>>>                         -mg_coarse_telescope_mg_levels_ksp_max_it
>>>>>                         value: 1
>>>>>                         >>>
>>>>>                         -mg_coarse_telescope_mg_levels_ksp_type
>>>>>                         value: richardson
>>>>>                         >>>
>>>>>                         >>>
>>>>>                         >>> Regards,
>>>>>                         >>> Frank
>>>>>                         >>>
>>>>>                         >>> On 07/13/2016 05:47 PM, Dave May wrote:
>>>>>                         >>>>
>>>>>                         >>>> On 14 July 2016 at 01:07, frank
>>>>>                         <hengjiew at uci.edu> wrote:
>>>>>                         >>>> Hi Dave,
>>>>>                         >>>>
>>>>>                         >>>> Sorry for the late reply.
>>>>>                         >>>> Thank you so much for your detailed
>>>>>                         reply.
>>>>>                         >>>>
>>>>>                         >>>> I have a question about the
>>>>>                         estimation of the memory usage. There are
>>>>>                         4223139840 allocated non-zeros and 18432
>>>>>                         MPI processes. Double precision is used.
>>>>>                         So the memory per process is:
>>>>>                         >>>>   4223139840 * 8bytes / 18432 / 1024
>>>>>                         / 1024 = 1.74M ?
>>>>>                         >>>> Did I do sth wrong here? Because this
>>>>>                         seems too small.
>>>>>                         >>>>
>>>>>                         >>>> No - I totally f***ed it up. You are
>>>>>                         correct. That'll teach me for fumbling
>>>>>                         around with my iphone calculator and not
>>>>>                         using my brain. (Note that to convert to
>>>>>                         MB just divide by 1e6, not 1024^2 -
>>>>>                         although I apparently cannot convert
>>>>>                         between units correctly....)
>>>>>                         >>>>
>>>>>                         >>>> From the PETSc objects associated
>>>>>                         with the solver, It looks like it _should_
>>>>>                         run with 2GB per MPI rank. Sorry for my
>>>>>                         mistake. Possibilities are: somewhere in
>>>>>                         your usage of PETSc you've introduced a
>>>>>                         memory leak; PETSc is doing a huge over
>>>>>                         allocation (e.g. as per our discussion of
>>>>>                         MatPtAP); or in your application code
>>>>>                         there are other objects you have forgotten
>>>>>                         to log the memory for.
>>>>>                         >>>>
>>>>>                         >>>>
>>>>>                         >>>>
>>>>>                         >>>> I am running this job on Bluewater
>>>>>                         >>>> I am using the 7 points FD stencil in 3D.
>>>>>                         >>>>
>>>>>                         >>>> I thought so on both counts.
>>>>>                         >>>>
>>>>>                         >>>> I apologize that I made a stupid
>>>>>                         mistake in computing the memory per core.
>>>>>                         My settings render each core can access
>>>>>                         only 2G memory on average instead of 8G
>>>>>                         which I mentioned in previous email. I
>>>>>                         re-run the job with 8G memory per core on
>>>>>                         average and there is no "Out Of Memory"
>>>>>                         error. I would do more test to see if
>>>>>                         there is still some memory issue.
>>>>>                         >>>>
>>>>>                         >>>> Ok. I'd still like to know where the
>>>>>                         memory was being used since my estimates
>>>>>                         were off.
>>>>>                         >>>>
>>>>>                         >>>>
>>>>>                         >>>> Thanks,
>>>>>                         >>>>   Dave
>>>>>                         >>>>
>>>>>                         >>>> Regards,
>>>>>                         >>>> Frank
>>>>>                         >>>>
>>>>>                         >>>>
>>>>>                         >>>>
>>>>>                         >>>> On 07/11/2016 01:18 PM, Dave May wrote:
>>>>>                         >>>>> Hi Frank,
>>>>>                         >>>>>
>>>>>                         >>>>>
>>>>>                         >>>>> On 11 July 2016 at 19:14, frank
>>>>>                         <hengjiew at uci.edu> wrote:
>>>>>                         >>>>> Hi Dave,
>>>>>                         >>>>>
>>>>>                         >>>>> I re-run the test using bjacobi as
>>>>>                         the preconditioner on the coarse mesh of
>>>>>                         telescope. The Grid is 3072*256*768 and
>>>>>                         process mesh is 96*8*24. The petsc option
>>>>>                         file is attached.
>>>>>                         >>>>> I still got the "Out Of Memory"
>>>>>                         error. The error occurred before the
>>>>>                         linear solver finished one step. So I
>>>>>                         don't have the full info from ksp_view.
>>>>>                         The info from ksp_view_pre is attached.
>>>>>                         >>>>>
>>>>>                         >>>>> Okay - that is essentially useless
>>>>>                         (sorry)
>>>>>                         >>>>>
>>>>>                         >>>>> It seems to me that the error
>>>>>                         occurred when the decomposition was going
>>>>>                         to be changed.
>>>>>                         >>>>>
>>>>>                         >>>>> Based on what information?
>>>>>                         >>>>> Running with -info would give us
>>>>>                         more clues, but will create a ton of output.
>>>>>                         >>>>> Please try running the case which
>>>>>                         failed with -info
>>>>>                         >>>>>  I had another test with a grid of
>>>>>                         1536*128*384 and the same process mesh as
>>>>>                         above. There was no error. The ksp_view
>>>>>                         info is attached for comparison.
>>>>>                         >>>>> Thank you.
>>>>>                         >>>>>
>>>>>                         >>>>>
>>>>>                         >>>>> [3] Here is my crude estimate of
>>>>>                         your memory usage.
>>>>>                         >>>>> I'll target the biggest memory hogs
>>>>>                         only to get an order of magnitude estimate
>>>>>                         >>>>>
>>>>>                         >>>>> * The Fine grid operator contains
>>>>>                         4223139840 non-zeros --> 1.8 GB per MPI
>>>>>                         rank assuming double precision.
>>>>>                         >>>>> The indices for the AIJ could amount
>>>>>                         to another 0.3 GB (assuming 32 bit integers)
>>>>>                         >>>>>
>>>>>                         >>>>> * You use 5 levels of coarsening, so
>>>>>                         the other operators should represent
>>>>>                         (collectively)
>>>>>                         >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 +
>>>>>                         2.1/8^4  ~ 300 MB per MPI rank on the
>>>>>                         communicator with 18432 ranks.
>>>>>                         >>>>> The coarse grid should consume ~ 0.5
>>>>>                         MB per MPI rank on the communicator with
>>>>>                         18432 ranks.
>>>>>                         >>>>>
>>>>>                         >>>>> * You use a reduction factor of 64,
>>>>>                         making the new communicator with 288 MPI
>>>>>                         ranks.
>>>>>                         >>>>> PCTelescope will first gather a
>>>>>                         temporary matrix associated with your
>>>>>                         coarse level operator assuming a comm size
>>>>>                         of 288 living on the comm with size 18432.
>>>>>                         >>>>> This matrix will require
>>>>>                         approximately 0.5 * 64 = 32 MB per core on
>>>>>                         the 288 ranks.
>>>>>                         >>>>> This matrix is then used to form a
>>>>>                         new MPIAIJ matrix on the subcomm, thus
>>>>>                         require another 32 MB per rank.
>>>>>                         >>>>> The temporary matrix is now destroyed.
>>>>>                         >>>>>
>>>>>                         >>>>> * Because a DMDA is detected, a
>>>>>                         permutation matrix is assembled.
>>>>>                         >>>>> This requires 2 doubles per point in
>>>>>                         the DMDA.
>>>>>                         >>>>> Your coarse DMDA contains 92 x 16 x
>>>>>                         48 points.
>>>>>                         >>>>> Thus the permutation matrix will
>>>>>                         require < 1 MB per MPI rank on the sub-comm.
>>>>>                         >>>>>
>>>>>                         >>>>> * Lastly, the matrix is permuted.
>>>>>                         This uses MatPtAP(), but the resulting
>>>>>                         operator will have the same memory
>>>>>                         footprint as the unpermuted matrix (32
>>>>>                         MB). At any stage in PCTelescope, only 2
>>>>>                         operators of size 32 MB are held in memory
>>>>>                         when the DMDA is provided.
>>>>>                         >>>>>
>>>>>                         >>>>> From my rough estimates, the worst
>>>>>                         case memory foot print for any given core,
>>>>>                         given your options is approximately
>>>>>                         >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1
>>>>>                         MB  = 2465 MB
>>>>>                         >>>>> This is way below 8 GB.
>>>>>                         >>>>>
>>>>>                         >>>>> Note this estimate completely ignores:
>>>>>                         >>>>> (1) the memory required for the
>>>>>                         restriction operator,
>>>>>                         >>>>> (2) the potential growth in the
>>>>>                         number of non-zeros per row due to
>>>>>                         Galerkin coarsening (I wished
>>>>>                         -ksp_view_pre reported the output from
>>>>>                         MatView so we could see the number of
>>>>>                         non-zeros required by the coarse level
>>>>>                         operators)
>>>>>                         >>>>> (3) all temporary vectors required
>>>>>                         by the CG solver, and those required by
>>>>>                         the smoothers.
>>>>>                         >>>>> (4) internal memory allocated by MatPtAP
>>>>>                         >>>>> (5) memory associated with IS's used
>>>>>                         within PCTelescope
>>>>>                         >>>>>
>>>>>                         >>>>> So either I am completely off in my
>>>>>                         estimates, or you have not carefully
>>>>>                         estimated the memory usage of your
>>>>>                         application code. Hopefully others might
>>>>>                         examine/correct my rough estimates
>>>>>                         >>>>>
>>>>>                         >>>>> Since I don't have your code I
>>>>>                         cannot access the latter.
>>>>>                         >>>>> Since I don't have access to the
>>>>>                         same machine you are running on, I think
>>>>>                         we need to take a step back.
>>>>>                         >>>>>
>>>>>                         >>>>> [1] What machine are you running on?
>>>>>                         Send me a URL if its available
>>>>>                         >>>>>
>>>>>                         >>>>> [2] What discretization are you
>>>>>                         using? (I am guessing a scalar 7 point FD
>>>>>                         stencil)
>>>>>                         >>>>> If it's a 7 point FD stencil, we
>>>>>                         should be able to examine the memory usage
>>>>>                         of your solver configuration using a
>>>>>                         standard, light weight existing PETSc
>>>>>                         example, run on your machine at the same
>>>>>                         scale.
>>>>>                         >>>>> This would hopefully enable us to
>>>>>                         correctly evaluate the actual memory usage
>>>>>                         required by the solver configuration you
>>>>>                         are using.
>>>>>                         >>>>>
>>>>>                         >>>>> Thanks,
>>>>>                         >>>>>   Dave
>>>>>                         >>>>>
>>>>>                         >>>>>
>>>>>                         >>>>> Frank
>>>>>                         >>>>>
>>>>>                         >>>>>
>>>>>                         >>>>>
>>>>>                         >>>>>
>>>>>                         >>>>> On 07/08/2016 10:38 PM, Dave May wrote:
>>>>>                         >>>>>>
>>>>>                         >>>>>> On Saturday, 9 July 2016, frank
>>>>>                         <hengjiew at uci.edu> wrote:
>>>>>                         >>>>>> Hi Barry and Dave,
>>>>>                         >>>>>>
>>>>>                         >>>>>> Thank both of you for the advice.
>>>>>                         >>>>>>
>>>>>                         >>>>>> @Barry
>>>>>                         >>>>>> I made a mistake in the file names
>>>>>                         in last email. I attached the correct
>>>>>                         files this time.
>>>>>                         >>>>>> For all the three tests,
>>>>>                         'Telescope' is used as the coarse
>>>>>                         preconditioner.
>>>>>                         >>>>>>
>>>>>                         >>>>>> == Test1:   Grid: 1536*128*384, 
>>>>>                          Process Mesh: 48*4*12
>>>>>                         >>>>>> Part of the memory usage:  Vector 
>>>>>                          125     124 3971904     0.
>>>>>                         >>>>>> Matrix   101 101 9462372     0
>>>>>                         >>>>>>
>>>>>                         >>>>>> == Test2: Grid: 1536*128*384, 
>>>>>                          Process Mesh: 96*8*24
>>>>>                         >>>>>> Part of the memory usage:  Vector 
>>>>>                          125     124 681672     0.
>>>>>                         >>>>>> Matrix   101 101 1462180     0.
>>>>>                         >>>>>>
>>>>>                         >>>>>> In theory, the memory usage in
>>>>>                         Test1 should be 8 times of Test2. In my
>>>>>                         case, it is about 6 times.
>>>>>                         >>>>>>
>>>>>                         >>>>>> == Test3: Grid: 3072*256*768, 
>>>>>                          Process Mesh: 96*8*24. Sub-domain per
>>>>>                         process: 32*32*32
>>>>>                         >>>>>> Here I get the out of memory error.
>>>>>                         >>>>>>
>>>>>                         >>>>>> I tried to use -mg_coarse jacobi.
>>>>>                         In this way, I don't need to set
>>>>>                         -mg_coarse_ksp_type and -mg_coarse_pc_type
>>>>>                         explicitly, right?
>>>>>                         >>>>>> The linear solver didn't work in
>>>>>                         this case. Petsc output some errors.
>>>>>                         >>>>>>
>>>>>                         >>>>>> @Dave
>>>>>                         >>>>>> In test3, I use only one instance
>>>>>                         of 'Telescope'. On the coarse mesh of
>>>>>                         'Telescope', I used LU as the
>>>>>                         preconditioner instead of SVD.
>>>>>                         >>>>>> If my set the levels correctly,
>>>>>                         then on the last coarse mesh of MG where
>>>>>                         it calls 'Telescope', the sub-domain per
>>>>>                         process is 2*2*2.
>>>>>                         >>>>>> On the last coarse mesh of
>>>>>                         'Telescope', there is only one grid point
>>>>>                         per process.
>>>>>                         >>>>>> I still got the OOM error. The
>>>>>                         detailed petsc option file is attached.
>>>>>                         >>>>>>
>>>>>                         >>>>>> Do you understand the expected
>>>>>                         memory usage for the particular parallel
>>>>>                         LU implementation you are using? I don't
>>>>>                         (seriously). Replace LU with bjacobi and
>>>>>                         re-run this test. My point about solver
>>>>>                         debugging is still valid.
>>>>>                         >>>>>>
>>>>>                         >>>>>> And please send the result of
>>>>>                         KSPView so we can see what is actually
>>>>>                         used in the computations
>>>>>                         >>>>>>
>>>>>                         >>>>>> Thanks
>>>>>                         >>>>>>   Dave
>>>>>                         >>>>>>
>>>>>                         >>>>>>
>>>>>                         >>>>>> Thank you so much.
>>>>>                         >>>>>>
>>>>>                         >>>>>> Frank
>>>>>                         >>>>>>
>>>>>                         >>>>>>
>>>>>                         >>>>>>
>>>>>                         >>>>>> On 07/06/2016 02:51 PM, Barry Smith
>>>>>                         wrote:
>>>>>                         >>>>>> On Jul 6, 2016, at 4:19 PM, frank
>>>>>                         <hengjiew at uci.edu> wrote:
>>>>>                         >>>>>>
>>>>>                         >>>>>> Hi Barry,
>>>>>                         >>>>>>
>>>>>                         >>>>>> Thank you for you advice.
>>>>>                         >>>>>> I tried three test. In the 1st
>>>>>                         test, the grid is 3072*256*768 and the
>>>>>                         process mesh is 96*8*24.
>>>>>                         >>>>>> The linear solver is 'cg' the
>>>>>                         preconditioner is 'mg' and 'telescope' is
>>>>>                         used as the preconditioner at the coarse mesh.
>>>>>                         >>>>>> The system gives me the "Out of
>>>>>                         Memory" error before the linear system is
>>>>>                         completely solved.
>>>>>                         >>>>>> The info from '-ksp_view_pre' is
>>>>>                         attached. I seems to me that the error
>>>>>                         occurs when it reaches the coarse mesh.
>>>>>                         >>>>>>
>>>>>                         >>>>>> The 2nd test uses a grid of
>>>>>                         1536*128*384 and process mesh is 96*8*24.
>>>>>                         The 3rd          test uses the same grid
>>>>>                         but a different process mesh 48*4*12.
>>>>>                         >>>>>>     Are you sure this is right? The
>>>>>                         total matrix and vector memory usage goes
>>>>>                         from 2nd test
>>>>>                         >>>>>>                Vector   384       
>>>>>                             383 8,193,712  0.
>>>>>                         >>>>>>                Matrix   103       
>>>>>                             103  11,508,688  0.
>>>>>                         >>>>>> to 3rd test
>>>>>                         >>>>>>               Vector   384         
>>>>>                           383 1,590,520  0.
>>>>>                         >>>>>>                Matrix   103       
>>>>>                             103 3,508,664  0.
>>>>>                         >>>>>> that is the memory usage got
>>>>>                         smaller but if you have only 1/8th the
>>>>>                         processes and the same grid it should have
>>>>>                         gotten about 8 times bigger. Did you maybe
>>>>>                         cut the grid by a factor of 8 also? If so
>>>>>                         that still doesn't explain it because the
>>>>>                         memory usage changed by a factor of 5
>>>>>                         something for the vectors and 3 something
>>>>>                         for the matrices.
>>>>>                         >>>>>>
>>>>>                         >>>>>>
>>>>>                         >>>>>> The linear solver and petsc options
>>>>>                         in 2nd and 3rd tests are the same in 1st
>>>>>                         test. The linear solver works fine in both
>>>>>                         test.
>>>>>                         >>>>>> I attached the memory usage of the
>>>>>                         2nd and 3rd tests. The memory info is from
>>>>>                         the option '-log_summary'. I tried to use
>>>>>                         '-momery_info' as you suggested, but in my
>>>>>                         case petsc treated it as an unused option.
>>>>>                         It output nothing about the memory. Do I
>>>>>                         need to add sth to my code so I can use
>>>>>                         '-memory_info'?
>>>>>                         >>>>>>     Sorry, my mistake the option is
>>>>>                         -memory_view
>>>>>                         >>>>>>
>>>>>                         >>>>>>    Can you run the one case with
>>>>>                         -memory_view and -mg_coarse jacobi
>>>>>                         -ksp_max_it 1 (just so it doesn't iterate
>>>>>                         forever) to see how much memory is used
>>>>>                         without the telescope? Also run case 2 the
>>>>>                         same way.
>>>>>                         >>>>>>
>>>>>                         >>>>>>    Barry
>>>>>                         >>>>>>
>>>>>                         >>>>>>
>>>>>                         >>>>>>
>>>>>                         >>>>>> In both tests the memory usage is
>>>>>                         not large.
>>>>>                         >>>>>>
>>>>>                         >>>>>> It seems to me that it might be the
>>>>>                         'telescope' preconditioner that allocated
>>>>>                         a lot of memory and caused the error in
>>>>>                         the 1st test.
>>>>>                         >>>>>> Is there is a way to show how much
>>>>>                         memory it allocated?
>>>>>                         >>>>>>
>>>>>                         >>>>>> Frank
>>>>>                         >>>>>>
>>>>>                         >>>>>> On 07/05/2016 03:37 PM, Barry Smith
>>>>>                         wrote:
>>>>>                         >>>>>>    Frank,
>>>>>                         >>>>>>
>>>>>                         >>>>>>      You can run with -ksp_view_pre
>>>>>                         to have it "view" the KSP before the solve
>>>>>                         so hopefully it gets that far.
>>>>>                         >>>>>>
>>>>>                         >>>>>>       Please run the problem that
>>>>>                         does fit with -memory_info when the
>>>>>                         problem completes it will show the "high
>>>>>                         water mark" for PETSc allocated memory and
>>>>>                         total memory used. We first want to look
>>>>>                         at these numbers to see if it is using
>>>>>                         more memory than you expect. You could
>>>>>                         also run with say half the grid spacing to
>>>>>                         see how the memory usage scaled with the
>>>>>                         increase in grid points. Make the runs
>>>>>                         also with -log_view and send all the
>>>>>                         output from these options.
>>>>>                         >>>>>>
>>>>>                         >>>>>>     Barry
>>>>>                         >>>>>>
>>>>>                         >>>>>> On Jul 5, 2016, at 5:23 PM, frank
>>>>>                         <hengjiew at uci.edu> wrote:
>>>>>                         >>>>>>
>>>>>                         >>>>>> Hi,
>>>>>                         >>>>>>
>>>>>                         >>>>>> I am using the CG ksp solver and
>>>>>                         Multigrid preconditioner to solve a linear
>>>>>                         system in parallel.
>>>>>                         >>>>>> I chose to use the 'Telescope' as
>>>>>                         the preconditioner on the coarse mesh for
>>>>>                         its good performance.
>>>>>                         >>>>>> The petsc options file is attached.
>>>>>                         >>>>>>
>>>>>                         >>>>>> The domain is a 3d box.
>>>>>                         >>>>>> It works well when the grid is 
>>>>>                         1536*128*384 and the process mesh is
>>>>>                         96*8*24. When I double the size of grid
>>>>>                         and  keep the same process mesh and petsc
>>>>>                         options, I get an "out of memory" error
>>>>>                         from the super-cluster I am using.
>>>>>                         >>>>>> Each process has access to at least
>>>>>                         8G memory, which should be more than
>>>>>                         enough for my application. I am sure that
>>>>>                         all the other parts of my code( except the
>>>>>                         linear solver ) do not use much memory. So
>>>>>                         I doubt if there is something wrong with
>>>>>                         the linear solver.
>>>>>                         >>>>>> The error occurs before the linear
>>>>>                         system is completely solved so I don't
>>>>>                         have the info from ksp view. I am not able
>>>>>                         to re-produce the error with a smaller
>>>>>                         problem either.
>>>>>                         >>>>>> In addition,  I tried to use the
>>>>>                         block jacobi as the preconditioner with
>>>>>                         the same grid and same decomposition. The
>>>>>                         linear solver runs extremely slow but
>>>>>                         there is no memory error.
>>>>>                         >>>>>>
>>>>>                         >>>>>> How can I diagnose what exactly
>>>>>                         cause the error?
>>>>>                         >>>>>> Thank you so much.
>>>>>                         >>>>>>
>>>>>                         >>>>>> Frank
>>>>>                         >>>>>> <petsc_options.txt>
>>>>>                         >>>>>>
>>>>>                         <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>
>>>>>                         >>>>>>
>>>>>                         >>>>>
>>>>>                         >>>>
>>>>>                         >>>
>>>>>                         <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><memory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt>
>>>>>                         >
>>>>>
>>>>
>>>
>>>
>>
>
>
>
>
> -- 
> What most experimenters take for granted before they begin their 
> experiments is infinitely more interesting than any results to which 
> their experiments lead.
> -- Norbert Wiener

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161004/3a5aa3f6/attachment-0001.html>