[petsc-users] Performance of the Telescope Multigrid Preconditioner

Fri Oct 7 16:49:45 CDT 2016

Dear all,

Thank you so much for the advice.

> All setup is done in the first solve.
>
>     ** The time for 1st solve does not scale.
>         In practice, I am solving a variable coefficient  Poisson
>     equation. I need to build the matrix every time step. Therefore,
>     each step is similar to the 1st solve which does not scale. Is
>     there a way I can improve the performance?
>
>
>     You could use rediscretization instead of Galerkin to produce the
>     coarse operators.
>
>
> Yes I can think of one option for improved performance, but I cannot 
> tell whether it will be beneficial because the logging isn't 
> sufficiently fine grained (and there is no easy way to get the info 
> out of petsc).
>
> I use PtAP to repartition the matrix, this could be consuming most of 
> the setup time in Telescope with your run. Such a repartitioning could 
> be avoid if you provided a method to create the operator on the coarse 
> levels (what Matt is suggesting). However, this requires you to be 
> able to define your coefficients on the coarse grid. This will most 
> likely reduce setup time, but your coarse grid operators (now 
> re-discretized) are likely to be less effective than those generated 
> via Galerkin coarsening.

Please correct me if I understand this incorrectly:   I can define my 
own restriction function and pass it to petsc instead of using PtAP.
If so,what's the interface to do that?

>     Also, you use CG/MG when FMG by itself would probably be faster.
>     Your smoother is likely not strong enough, and you
>     should use something like V(2,2). There is a lot of tuning that is
>     possible, but difficult to automate.
>
>
> Matt's completely correct.
> If we could automate this in a meaningful manner, we would have done so.

I am not as familiar with multigrid as you guys. It would be very kind 
if you could be more specific.
What does V(2,2) stand for? Is there some strong smoother build in petsc 
that I can try?

Another thing, the vector assemble and scatter take more time as I 
increased the cores#:

  cores#                                       4096 8192          
16384         32768          65536
VecAssemblyBegin       298        2.91E+00    2.87E+00 8.59E+00    
2.75E+01    2.21E+03
VecAssemblyEnd          298        3.37E-03    1.78E-03 1.78E-03       
5.13E-03    1.99E-03
VecScatterBegin           76303    3.82E+00    3.01E+00 2.54E+00    
4.40E+00    1.32E+00
VecScatterEnd              76303    3.09E+01    1.47E+01 2.23E+01    
2.96E+01    2.10E+01

The above data is produced by solving a constant coefficients Possoin 
equation with different rhs for 100 steps.
As you can see, the time of VecAssemblyBegin increase dramatically from 
32K cores to 65K.
With 65K cores, it took more time to assemble the rhs than solving the 
equation.   Is there a way to improve this?

Thank you.

Regards,
Frank

>
>
>
>
>
>         On 10/04/2016 12:56 PM, Dave May wrote:
>>
>>
>>         On Tuesday, 4 October 2016, frank <hengjiew at uci.edu
>>         <mailto:hengjiew at uci.edu>> wrote:
>>
>>             Hi,
>>
>>             This question is follow-up of the thread "Question about
>>             memory usage in Multigrid preconditioner".
>>             I used to have the "Out of Memory(OOM)" problem when
>>             using the CG+Telescope MG solver with 32768 cores. Adding
>>             the "-matrap 0; -matptap_scalable" option did solve that
>>             problem.
>>
>>             Then I test the scalability by solving a 3d poisson eqn
>>             for 1 step. I used one sub-communicator in all the tests.
>>             The difference between the petsc options in those tests
>>             are: 1 the pc_telescope_reduction_factor; 2 the number of
>>             multigrid levels in the up/down solver. The function
>>             "ksp_solve" is timed. It is kind of slow and doesn't
>>             scale at all.
>>
>>             Test1: 512^3 grid points
>>             Core# telescope_reduction_factor MG levels# for up/down
>>             solver Time for KSPSolve (s)
>>             512 8 4 / 3 6.2466
>>             4096 64 5 / 3 0.9361
>>             32768 64 4 / 3 4.8914
>>
>>             Test2: 1024^3 grid points
>>             Core# telescope_reduction_factor MG levels# for up/down
>>             solver Time for KSPSolve (s)
>>             4096 64 5 / 4 3.4139
>>             8192 128 5 / 4 2.4196
>>             16384         32 5 / 3 5.4150
>>             32768 64 5 / 3 5.6067
>>             65536 128 5 / 3 6.5219
>>
>>
>>         You have to be very careful how you interpret these numbers.
>>         Your solver contains nested calls to KSPSolve, and
>>         unfortunately as a result the numbers you report include
>>         setup time. This will remain true even if you call KSPSetUp
>>         on the outermost KSP.
>>
>>         Your email concerns scalability of the silver application, so
>>         let's focus on that issue.
>>
>>         The only way to clearly separate setup from solve time is
>>         to perform two identical solves. The second solve will not
>>         require any setup. You should monitor the second solve via a
>>         new PetscStage.
>>
>>         This was what I did in the telescope paper. It was the only
>>         way to understand the setup cost (and scaling) cf the solve
>>         time (and scaling).
>>
>>         Thanks
>>           Dave
>>
>>             I guess I didn't set the MG levels properly. What would
>>             be the efficient way to arrange the MG levels?
>>             Also which preconditionr at the coarse mesh of the 2nd
>>             communicator should I use to improve the performance?
>>
>>             I attached the test code and the petsc options file for
>>             the 1024^3 cube with 32768 cores.
>>
>>             Thank you.
>>
>>             Regards,
>>             Frank
>>
>>
>>
>>
>>
>>
>>             On 09/15/2016 03:35 AM, Dave May wrote:
>>>             HI all,
>>>
>>>             I the only unexpected memory usage I can see is
>>>             associated with the call to MatPtAP().
>>>             Here is something you can try immediately.
>>>             Run your code with the additional options
>>>               -matrap 0 -matptap_scalable
>>>
>>>             I didn't realize this before, but the default behaviour
>>>             of MatPtAP in parallel is actually to to explicitly form
>>>             the transpose of P (e.g. assemble R = P^T) and then
>>>             compute R.A.P.
>>>             You don't want to do this. The option -matrap 0 resolves
>>>             this issue.
>>>
>>>             The implementation of P^T.A.P has two variants.
>>>             The scalable implementation (with respect to memory
>>>             usage) is selected via the second option -matptap_scalable.
>>>
>>>             Try it out - I see a significant memory reduction using
>>>             these options for particular mesh sizes / partitions.
>>>
>>>             I've attached a cleaned up version of the code you sent me.
>>>             There were a number of memory leaks and other issues.
>>>             The main points being
>>>               * You should call DMDAVecGetArrayF90() before
>>>             VecAssembly{Begin,End}
>>>               * You should call PetscFinalize(), otherwise the
>>>             option -log_summary (-log_view) will not display
>>>             anything once the program has completed.
>>>
>>>
>>>             Thanks,
>>>               Dave
>>>
>>>
>>>             On 15 September 2016 at 08:03, Hengjie Wang
>>>             <hengjiew at uci.edu> wrote:
>>>
>>>                 Hi Dave,
>>>
>>>                 Sorry, I should have put more comment to explain the
>>>                 code.
>>>                 The number of process in each dimension is the same:
>>>                 Px = Py=Pz=P. So is the domain size.
>>>                 So if the you want to run the code for a  512^3 grid
>>>                 points on 16^3 cores, you need to set "-N 512 -P 16"
>>>                 in the command line.
>>>                 I add more comments and also fix an error in the
>>>                 attached code. ( The error only effects the accuracy
>>>                 of solution but not the memory usage. )
>>>
>>>                 Thank you.
>>>                 Frank
>>>
>>>
>>>                 On 9/14/2016 9:05 PM, Dave May wrote:
>>>>
>>>>
>>>>                 On Thursday, 15 September 2016, Dave May
>>>>                 <dave.mayhem23 at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>                     On Thursday, 15 September 2016, frank
>>>>                     <hengjiew at uci.edu> wrote:
>>>>
>>>>                         Hi,
>>>>
>>>>                         I write a simple code to re-produce the
>>>>                         error. I hope this can help to diagnose the
>>>>                         problem.
>>>>                         The code just solves a 3d poisson equation.
>>>>
>>>>
>>>>                     Why is the stencil width a runtime parameter??
>>>>                     And why is the default value 2? For 7-pnt FD
>>>>                     Laplace, you only need a stencil width of 1.
>>>>
>>>>                     Was this choice made to mimic something in the
>>>>                     real application code?
>>>>
>>>>
>>>>                 Please ignore - I misunderstood your usage of the
>>>>                 param set by -P
>>>>
>>>>
>>>>                         I run the code on a 1024^3 mesh. The
>>>>                         process partition is 32 * 32 * 32. That's
>>>>                         when I re-produce the OOM error. Each core
>>>>                         has about 2G memory.
>>>>                         I also run the code on a 512^3 mesh with 16
>>>>                         * 16 * 16 processes. The ksp solver works
>>>>                         fine.
>>>>                         I attached the code, ksp_view_pre's output
>>>>                         and my petsc option file.
>>>>
>>>>                         Thank you.
>>>>                         Frank
>>>>
>>>>                         On 09/09/2016 06:38 PM, Hengjie Wang wrote:
>>>>>                         Hi Barry,
>>>>>
>>>>>                         I checked. On the supercomputer, I had the
>>>>>                         option "-ksp_view_pre" but it is not in
>>>>>                         file I sent you. I am sorry for the confusion.
>>>>>
>>>>>                         Regards,
>>>>>                         Frank
>>>>>
>>>>>                         On Friday, September 9, 2016, Barry Smith
>>>>>                         <bsmith at mcs.anl.gov> wrote:
>>>>>
>>>>>
>>>>>                             > On Sep 9, 2016, at 3:11 PM, frank
>>>>>                             <hengjiew at uci.edu> wrote:
>>>>>                             >
>>>>>                             > Hi Barry,
>>>>>                             >
>>>>>                             > I think the first KSP view output is
>>>>>                             from -ksp_view_pre. Before I submitted
>>>>>                             the test, I was not sure whether there
>>>>>                             would be OOM error or not. So I added
>>>>>                             both -ksp_view_pre and -ksp_view.
>>>>>
>>>>>                               But the options file you sent
>>>>>                             specifically does NOT list the
>>>>>                             -ksp_view_pre so how could it be from
>>>>>                             that?
>>>>>
>>>>>                                Sorry to be pedantic but I've spent
>>>>>                             too much time in the past trying to
>>>>>                             debug from incorrect information and
>>>>>                             want to make sure that the information
>>>>>                             I have is correct before thinking.
>>>>>                             Please recheck exactly what happened.
>>>>>                             Rerun with the exact input file you
>>>>>                             emailed if that is needed.
>>>>>
>>>>>                                Barry
>>>>>
>>>>>                             >
>>>>>                             > Frank
>>>>>                             >
>>>>>                             >
>>>>>                             > On 09/09/2016 12:38 PM, Barry Smith
>>>>>                             wrote:
>>>>>                             >>   Why does ksp_view2.txt have two
>>>>>                             KSP views in it while ksp_view1.txt
>>>>>                             has only one KSPView in it? Did you
>>>>>                             run two different solves in the 2 case
>>>>>                             but not the one?
>>>>>                             >>
>>>>>                             >>  Barry
>>>>>                             >>
>>>>>                             >>
>>>>>                             >>
>>>>>                             >>> On Sep 9, 2016, at 10:56 AM, frank
>>>>>                             <hengjiew at uci.edu> wrote:
>>>>>                             >>>
>>>>>                             >>> Hi,
>>>>>                             >>>
>>>>>                             >>> I want to continue digging into
>>>>>                             the memory problem here.
>>>>>                             >>> I did find a work around in the
>>>>>                             past, which is to use less cores per
>>>>>                             node so that each core has 8G memory.
>>>>>                             However this is deficient and
>>>>>                             expensive. I hope to locate the place
>>>>>                             that uses the most memory.
>>>>>                             >>>
>>>>>                             >>> Here is a brief summary of the
>>>>>                             tests I did in past:
>>>>>                             >>>> Test1:   Mesh 1536*128*384  | 
>>>>>                             Process Mesh 48*4*12
>>>>>                             >>> Maximum (over computational time)
>>>>>                             process memory:    total 7.0727e+08
>>>>>                             >>> Current process memory:      
>>>>>                              total 7.0727e+08
>>>>>                             >>> Maximum (over computational time)
>>>>>                             space PetscMalloc()ed: total 6.3908e+11
>>>>>                             >>> Current space PetscMalloc()ed:   
>>>>>                                                                  
>>>>>                                   total 1.8275e+09
>>>>>                             >>>
>>>>>                             >>>> Test2:    Mesh 1536*128*384  | 
>>>>>                             Process Mesh 96*8*24
>>>>>                             >>> Maximum (over computational time)
>>>>>                             process memory:    total 5.9431e+09
>>>>>                             >>> Current process memory:      
>>>>>                              total 5.9431e+09
>>>>>                             >>> Maximum (over computational time)
>>>>>                             space PetscMalloc()ed: total 5.3202e+12
>>>>>                             >>> Current space PetscMalloc()ed:   
>>>>>                                                                  
>>>>>                                    total 5.4844e+09
>>>>>                             >>>
>>>>>                             >>>> Test3:    Mesh 3072*256*768  | 
>>>>>                             Process Mesh 96*8*24
>>>>>                             >>>    OOM( Out Of Memory ) killer of
>>>>>                             the supercomputer terminated the job
>>>>>                             during "KSPSolve".
>>>>>                             >>>
>>>>>                             >>> I attached the output of ksp_view(
>>>>>                             the third test's output is from
>>>>>                             ksp_view_pre ), memory_view and also
>>>>>                             the petsc options.
>>>>>                             >>>
>>>>>                             >>> In all the tests, each core can
>>>>>                             access about 2G memory. In test3,
>>>>>                             there are 4223139840 non-zeros in the
>>>>>                             matrix. This will consume about 1.74M,
>>>>>                             using double precision. Considering
>>>>>                             some extra memory used to store
>>>>>                             integer index, 2G memory should still
>>>>>                             be way enough.
>>>>>                             >>>
>>>>>                             >>> Is there a way to find out which
>>>>>                             part of KSPSolve uses the most memory?
>>>>>                             >>> Thank you so much.
>>>>>                             >>>
>>>>>                             >>> BTW, there are 4 options remains
>>>>>                             unused and I don't understand why they
>>>>>                             are omitted:
>>>>>                             >>>
>>>>>                             -mg_coarse_telescope_mg_coarse_ksp_type
>>>>>                             value: preonly
>>>>>                             >>>
>>>>>                             -mg_coarse_telescope_mg_coarse_pc_type
>>>>>                             value: bjacobi
>>>>>                             >>>
>>>>>                             -mg_coarse_telescope_mg_levels_ksp_max_it
>>>>>                             value: 1
>>>>>                             >>>
>>>>>                             -mg_coarse_telescope_mg_levels_ksp_type
>>>>>                             value: richardson
>>>>>                             >>>
>>>>>                             >>>
>>>>>                             >>> Regards,
>>>>>                             >>> Frank
>>>>>                             >>>
>>>>>                             >>> On 07/13/2016 05:47 PM, Dave May
>>>>>                             wrote:
>>>>>                             >>>>
>>>>>                             >>>> On 14 July 2016 at 01:07, frank
>>>>>                             <hengjiew at uci.edu> wrote:
>>>>>                             >>>> Hi Dave,
>>>>>                             >>>>
>>>>>                             >>>> Sorry for the late reply.
>>>>>                             >>>> Thank you so much for your
>>>>>                             detailed reply.
>>>>>                             >>>>
>>>>>                             >>>> I have a question about the
>>>>>                             estimation of the memory usage. There
>>>>>                             are 4223139840 allocated non-zeros and
>>>>>                             18432 MPI processes. Double precision
>>>>>                             is used. So the memory per process is:
>>>>>                             >>>>   4223139840 * 8bytes / 18432 /
>>>>>                             1024 / 1024 = 1.74M ?
>>>>>                             >>>> Did I do sth wrong here? Because
>>>>>                             this seems too small.
>>>>>                             >>>>
>>>>>                             >>>> No - I totally f***ed it up. You
>>>>>                             are correct. That'll teach me for
>>>>>                             fumbling around with my iphone
>>>>>                             calculator and not using my brain.
>>>>>                             (Note that to convert to MB just
>>>>>                             divide by 1e6, not 1024^2 - although I
>>>>>                             apparently cannot convert between
>>>>>                             units correctly....)
>>>>>                             >>>>
>>>>>                             >>>> From the PETSc objects associated
>>>>>                             with the solver, It looks like it
>>>>>                             _should_ run with 2GB per MPI rank.
>>>>>                             Sorry for my mistake. Possibilities
>>>>>                             are: somewhere in your usage of PETSc
>>>>>                             you've introduced a memory leak; PETSc
>>>>>                             is doing a huge over allocation (e.g.
>>>>>                             as per our discussion of MatPtAP); or
>>>>>                             in your application code there are
>>>>>                             other objects you have forgotten to
>>>>>                             log the memory for.
>>>>>                             >>>>
>>>>>                             >>>>
>>>>>                             >>>>
>>>>>                             >>>> I am running this job on Bluewater
>>>>>                             >>>> I am using the 7 points FD
>>>>>                             stencil in 3D.
>>>>>                             >>>>
>>>>>                             >>>> I thought so on both counts.
>>>>>                             >>>>
>>>>>                             >>>> I apologize that I made a stupid
>>>>>                             mistake in computing the memory per
>>>>>                             core. My settings render each core can
>>>>>                             access only 2G memory on average
>>>>>                             instead of 8G which I mentioned in
>>>>>                             previous email. I re-run the job with
>>>>>                             8G memory per core on average and
>>>>>                             there is no "Out Of Memory" error. I
>>>>>                             would do more test to see if there is
>>>>>                             still some memory issue.
>>>>>                             >>>>
>>>>>                             >>>> Ok. I'd still like to know where
>>>>>                             the memory was being used since my
>>>>>                             estimates were off.
>>>>>                             >>>>
>>>>>                             >>>>
>>>>>                             >>>> Thanks,
>>>>>                             >>>>   Dave
>>>>>                             >>>>
>>>>>                             >>>> Regards,
>>>>>                             >>>> Frank
>>>>>                             >>>>
>>>>>                             >>>>
>>>>>                             >>>>
>>>>>                             >>>> On 07/11/2016 01:18 PM, Dave May
>>>>>                             wrote:
>>>>>                             >>>>> Hi Frank,
>>>>>                             >>>>>
>>>>>                             >>>>>
>>>>>                             >>>>> On 11 July 2016 at 19:14, frank
>>>>>                             <hengjiew at uci.edu> wrote:
>>>>>                             >>>>> Hi Dave,
>>>>>                             >>>>>
>>>>>                             >>>>> I re-run the test using bjacobi
>>>>>                             as the preconditioner on the coarse
>>>>>                             mesh of telescope. The Grid is
>>>>>                             3072*256*768 and process mesh is
>>>>>                             96*8*24. The petsc option file is
>>>>>                             attached.
>>>>>                             >>>>> I still got the "Out Of Memory"
>>>>>                             error. The error occurred before the
>>>>>                             linear solver finished one step. So I
>>>>>                             don't have the full info from
>>>>>                             ksp_view. The info from ksp_view_pre
>>>>>                             is attached.
>>>>>                             >>>>>
>>>>>                             >>>>> Okay - that is essentially
>>>>>                             useless (sorry)
>>>>>                             >>>>>
>>>>>                             >>>>> It seems to me that the error
>>>>>                             occurred when the decomposition was
>>>>>                             going to be changed.
>>>>>                             >>>>>
>>>>>                             >>>>> Based on what information?
>>>>>                             >>>>> Running with -info would give us
>>>>>                             more clues, but will create a ton of
>>>>>                             output.
>>>>>                             >>>>> Please try running the case
>>>>>                             which failed with -info
>>>>>                             >>>>>  I had another test with a grid
>>>>>                             of 1536*128*384 and the same process
>>>>>                             mesh as above. There was no error. The
>>>>>                             ksp_view info is attached for comparison.
>>>>>                             >>>>> Thank you.
>>>>>                             >>>>>
>>>>>                             >>>>>
>>>>>                             >>>>> [3] Here is my crude estimate of
>>>>>                             your memory usage.
>>>>>                             >>>>> I'll target the biggest memory
>>>>>                             hogs only to get an order of magnitude
>>>>>                             estimate
>>>>>                             >>>>>
>>>>>                             >>>>> * The Fine grid operator
>>>>>                             contains 4223139840 non-zeros --> 1.8
>>>>>                             GB per MPI rank assuming double precision.
>>>>>                             >>>>> The indices for the AIJ could
>>>>>                             amount to another 0.3 GB (assuming 32
>>>>>                             bit integers)
>>>>>                             >>>>>
>>>>>                             >>>>> * You use 5 levels of
>>>>>                             coarsening, so the other operators
>>>>>                             should represent (collectively)
>>>>>                             >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 +
>>>>>                             2.1/8^4  ~ 300 MB per MPI rank on the
>>>>>                             communicator with 18432 ranks.
>>>>>                             >>>>> The coarse grid should consume ~
>>>>>                             0.5 MB per MPI rank on the
>>>>>                             communicator with 18432 ranks.
>>>>>                             >>>>>
>>>>>                             >>>>> * You use a reduction factor of
>>>>>                             64, making the new communicator with
>>>>>                             288 MPI ranks.
>>>>>                             >>>>> PCTelescope will first gather a
>>>>>                             temporary matrix associated with your
>>>>>                             coarse level operator assuming a comm
>>>>>                             size of 288 living on the comm with
>>>>>                             size 18432.
>>>>>                             >>>>> This matrix will require
>>>>>                             approximately 0.5 * 64 = 32 MB per
>>>>>                             core on the 288 ranks.
>>>>>                             >>>>> This matrix is then used to form
>>>>>                             a new MPIAIJ matrix on the subcomm,
>>>>>                             thus require another 32 MB per rank.
>>>>>                             >>>>> The temporary matrix is now
>>>>>                             destroyed.
>>>>>                             >>>>>
>>>>>                             >>>>> * Because a DMDA is detected, a
>>>>>                             permutation matrix is assembled.
>>>>>                             >>>>> This requires 2 doubles per
>>>>>                             point in the DMDA.
>>>>>                             >>>>> Your coarse DMDA contains 92 x
>>>>>                             16 x 48 points.
>>>>>                             >>>>> Thus the permutation matrix will
>>>>>                             require < 1 MB per MPI rank on the
>>>>>                             sub-comm.
>>>>>                             >>>>>
>>>>>                             >>>>> * Lastly, the matrix is
>>>>>                             permuted. This uses MatPtAP(), but the
>>>>>                             resulting operator will have the same
>>>>>                             memory footprint as the unpermuted
>>>>>                             matrix (32 MB). At any stage in
>>>>>                             PCTelescope, only 2 operators of size
>>>>>                             32 MB are held in memory when the DMDA
>>>>>                             is provided.
>>>>>                             >>>>>
>>>>>                             >>>>> From my rough estimates, the
>>>>>                             worst case memory foot print for any
>>>>>                             given core, given your options is
>>>>>                             approximately
>>>>>                             >>>>> 2100 MB + 300 MB + 32 MB + 32 MB
>>>>>                             + 1 MB  = 2465 MB
>>>>>                             >>>>> This is way below 8 GB.
>>>>>                             >>>>>
>>>>>                             >>>>> Note this estimate completely
>>>>>                             ignores:
>>>>>                             >>>>> (1) the memory required for the
>>>>>                             restriction operator,
>>>>>                             >>>>> (2) the potential growth in the
>>>>>                             number of non-zeros per row due to
>>>>>                             Galerkin coarsening (I wished
>>>>>                             -ksp_view_pre reported the output from
>>>>>                             MatView so we could see the number of
>>>>>                             non-zeros required by the coarse level
>>>>>                             operators)
>>>>>                             >>>>> (3) all temporary vectors
>>>>>                             required by the CG solver, and those
>>>>>                             required by the smoothers.
>>>>>                             >>>>> (4) internal memory allocated by
>>>>>                             MatPtAP
>>>>>                             >>>>> (5) memory associated with IS's
>>>>>                             used within PCTelescope
>>>>>                             >>>>>
>>>>>                             >>>>> So either I am completely off in
>>>>>                             my estimates, or you have not
>>>>>                             carefully estimated the memory usage
>>>>>                             of your application code. Hopefully
>>>>>                             others might examine/correct my rough
>>>>>                             estimates
>>>>>                             >>>>>
>>>>>                             >>>>> Since I don't have your code I
>>>>>                             cannot access the latter.
>>>>>                             >>>>> Since I don't have access to the
>>>>>                             same machine you are running on, I
>>>>>                             think we need to take a step back.
>>>>>                             >>>>>
>>>>>                             >>>>> [1] What machine are you running
>>>>>                             on? Send me a URL if its available
>>>>>                             >>>>>
>>>>>                             >>>>> [2] What discretization are you
>>>>>                             using? (I am guessing a scalar 7 point
>>>>>                             FD stencil)
>>>>>                             >>>>> If it's a 7 point FD stencil, we
>>>>>                             should be able to examine the memory
>>>>>                             usage of your solver configuration
>>>>>                             using a standard, light weight
>>>>>                             existing PETSc example, run on your
>>>>>                             machine at the same scale.
>>>>>                             >>>>> This would hopefully enable us
>>>>>                             to correctly evaluate the actual
>>>>>                             memory usage required by the solver
>>>>>                             configuration you are using.
>>>>>                             >>>>>
>>>>>                             >>>>> Thanks,
>>>>>                             >>>>>   Dave
>>>>>                             >>>>>
>>>>>                             >>>>>
>>>>>                             >>>>> Frank
>>>>>                             >>>>>
>>>>>                             >>>>>
>>>>>                             >>>>>
>>>>>                             >>>>>
>>>>>                             >>>>> On 07/08/2016 10:38 PM, Dave May
>>>>>                             wrote:
>>>>>                             >>>>>>
>>>>>                             >>>>>> On Saturday, 9 July 2016, frank
>>>>>                             <hengjiew at uci.edu> wrote:
>>>>>                             >>>>>> Hi Barry and Dave,
>>>>>                             >>>>>>
>>>>>                             >>>>>> Thank both of you for the advice.
>>>>>                             >>>>>>
>>>>>                             >>>>>> @Barry
>>>>>                             >>>>>> I made a mistake in the file
>>>>>                             names in last email. I attached the
>>>>>                             correct files this time.
>>>>>                             >>>>>> For all the three tests,
>>>>>                             'Telescope' is used as the coarse
>>>>>                             preconditioner.
>>>>>                             >>>>>>
>>>>>                             >>>>>> == Test1:   Grid:
>>>>>                             1536*128*384,   Process Mesh: 48*4*12
>>>>>                             >>>>>> Part of the memory usage: 
>>>>>                             Vector   125     124 3971904     0.
>>>>>                             >>>>>> Matrix   101 101 9462372     0
>>>>>                             >>>>>>
>>>>>                             >>>>>> == Test2: Grid: 1536*128*384, 
>>>>>                              Process Mesh: 96*8*24
>>>>>                             >>>>>> Part of the memory usage: 
>>>>>                             Vector   125     124 681672     0.
>>>>>                             >>>>>> Matrix   101 101 1462180     0.
>>>>>                             >>>>>>
>>>>>                             >>>>>> In theory, the memory usage in
>>>>>                             Test1 should be 8 times of Test2. In
>>>>>                             my case, it is about 6 times.
>>>>>                             >>>>>>
>>>>>                             >>>>>> == Test3: Grid: 3072*256*768, 
>>>>>                              Process Mesh: 96*8*24. Sub-domain per
>>>>>                             process: 32*32*32
>>>>>                             >>>>>> Here I get the out of memory error.
>>>>>                             >>>>>>
>>>>>                             >>>>>> I tried to use -mg_coarse
>>>>>                             jacobi. In this way, I don't need to
>>>>>                             set -mg_coarse_ksp_type and
>>>>>                             -mg_coarse_pc_type explicitly, right?
>>>>>                             >>>>>> The linear solver didn't work
>>>>>                             in this case. Petsc output some errors.
>>>>>                             >>>>>>
>>>>>                             >>>>>> @Dave
>>>>>                             >>>>>> In test3, I use only one
>>>>>                             instance of 'Telescope'. On the coarse
>>>>>                             mesh of 'Telescope', I used LU as the
>>>>>                             preconditioner instead of SVD.
>>>>>                             >>>>>> If my set the levels correctly,
>>>>>                             then on the last coarse mesh of MG
>>>>>                             where it calls 'Telescope', the
>>>>>                             sub-domain per process is 2*2*2.
>>>>>                             >>>>>> On the last coarse mesh of
>>>>>                             'Telescope', there is only one grid
>>>>>                             point per process.
>>>>>                             >>>>>> I still got the OOM error. The
>>>>>                             detailed petsc option file is attached.
>>>>>                             >>>>>>
>>>>>                             >>>>>> Do you understand the expected
>>>>>                             memory usage for the particular
>>>>>                             parallel LU implementation you are
>>>>>                             using? I don't (seriously). Replace LU
>>>>>                             with bjacobi and re-run this test. My
>>>>>                             point about solver debugging is still
>>>>>                             valid.
>>>>>                             >>>>>>
>>>>>                             >>>>>> And please send the result of
>>>>>                             KSPView so we can see what is actually
>>>>>                             used in the computations
>>>>>                             >>>>>>
>>>>>                             >>>>>> Thanks
>>>>>                             >>>>>>   Dave
>>>>>                             >>>>>>
>>>>>                             >>>>>>
>>>>>                             >>>>>> Thank you so much.
>>>>>                             >>>>>>
>>>>>                             >>>>>> Frank
>>>>>                             >>>>>>
>>>>>                             >>>>>>
>>>>>                             >>>>>>
>>>>>                             >>>>>> On 07/06/2016 02:51 PM, Barry
>>>>>                             Smith wrote:
>>>>>                             >>>>>> On Jul 6, 2016, at 4:19 PM,
>>>>>                             frank <hengjiew at uci.edu> wrote:
>>>>>                             >>>>>>
>>>>>                             >>>>>> Hi Barry,
>>>>>                             >>>>>>
>>>>>                             >>>>>> Thank you for you advice.
>>>>>                             >>>>>> I tried three test. In the 1st
>>>>>                             test, the grid is 3072*256*768 and the
>>>>>                             process mesh is 96*8*24.
>>>>>                             >>>>>> The linear solver is 'cg' the
>>>>>                             preconditioner is 'mg' and 'telescope'
>>>>>                             is used as the preconditioner at the
>>>>>                             coarse mesh.
>>>>>                             >>>>>> The system gives me the "Out of
>>>>>                             Memory" error before the linear system
>>>>>                             is completely solved.
>>>>>                             >>>>>> The info from '-ksp_view_pre'
>>>>>                             is attached. I seems to me that the
>>>>>                             error occurs when it reaches the
>>>>>                             coarse mesh.
>>>>>                             >>>>>>
>>>>>                             >>>>>> The 2nd test uses a grid of
>>>>>                             1536*128*384 and process mesh is
>>>>>                             96*8*24. The 3rd          test uses
>>>>>                             the same grid but a different process
>>>>>                             mesh 48*4*12.
>>>>>                             >>>>>>     Are you sure this is right?
>>>>>                             The total matrix and vector memory
>>>>>                             usage goes from 2nd test
>>>>>                             >>>>>>                Vector   384   
>>>>>                                     383 8,193,712  0.
>>>>>                             >>>>>>                Matrix   103   
>>>>>                                     103  11,508,688  0.
>>>>>                             >>>>>> to 3rd test
>>>>>                             >>>>>>               Vector   384     
>>>>>                                   383 1,590,520  0.
>>>>>                             >>>>>>                Matrix   103   
>>>>>                                     103 3,508,664  0.
>>>>>                             >>>>>> that is the memory usage got
>>>>>                             smaller but if you have only 1/8th the
>>>>>                             processes and the same grid it should
>>>>>                             have gotten about 8 times bigger. Did
>>>>>                             you maybe cut the grid by a factor of
>>>>>                             8 also? If so that still doesn't
>>>>>                             explain it because the memory usage
>>>>>                             changed by a factor of 5 something for
>>>>>                             the vectors and 3 something for the
>>>>>                             matrices.
>>>>>                             >>>>>>
>>>>>                             >>>>>>
>>>>>                             >>>>>> The linear solver and petsc
>>>>>                             options in 2nd and 3rd tests are the
>>>>>                             same in 1st test. The linear solver
>>>>>                             works fine in both test.
>>>>>                             >>>>>> I attached the memory usage of
>>>>>                             the 2nd and 3rd tests. The memory info
>>>>>                             is from the option '-log_summary'. I
>>>>>                             tried to use '-momery_info' as you
>>>>>                             suggested, but in my case petsc
>>>>>                             treated it as an unused option. It
>>>>>                             output nothing about the memory. Do I
>>>>>                             need to add sth to my code so I can
>>>>>                             use '-memory_info'?
>>>>>                             >>>>>>     Sorry, my mistake the
>>>>>                             option is -memory_view
>>>>>                             >>>>>>
>>>>>                             >>>>>>    Can you run the one case
>>>>>                             with -memory_view and -mg_coarse
>>>>>                             jacobi -ksp_max_it 1 (just so it
>>>>>                             doesn't iterate forever) to see how
>>>>>                             much memory is used without the
>>>>>                             telescope? Also run case 2 the same way.
>>>>>                             >>>>>>
>>>>>                             >>>>>>    Barry
>>>>>                             >>>>>>
>>>>>                             >>>>>>
>>>>>                             >>>>>>
>>>>>                             >>>>>> In both tests the memory usage
>>>>>                             is not large.
>>>>>                             >>>>>>
>>>>>                             >>>>>> It seems to me that it might be
>>>>>                             the 'telescope' preconditioner that
>>>>>                             allocated a lot of memory and caused
>>>>>                             the error in the 1st test.
>>>>>                             >>>>>> Is there is a way to show how
>>>>>                             much memory it allocated?
>>>>>                             >>>>>>
>>>>>                             >>>>>> Frank
>>>>>                             >>>>>>
>>>>>                             >>>>>> On 07/05/2016 03:37 PM, Barry
>>>>>                             Smith wrote:
>>>>>                             >>>>>>    Frank,
>>>>>                             >>>>>>
>>>>>                             >>>>>>      You can run with
>>>>>                             -ksp_view_pre to have it "view" the
>>>>>                             KSP before the solve so hopefully it
>>>>>                             gets that far.
>>>>>                             >>>>>>
>>>>>                             >>>>>>       Please run the problem
>>>>>                             that does fit with -memory_info when
>>>>>                             the problem completes it will show the
>>>>>                             "high water mark" for PETSc allocated
>>>>>                             memory and total memory used. We first
>>>>>                             want to look at these numbers to see
>>>>>                             if it is using more memory than you
>>>>>                             expect. You could also run with say
>>>>>                             half the grid spacing to see how the
>>>>>                             memory usage scaled with the increase
>>>>>                             in grid points. Make the runs also
>>>>>                             with -log_view and send all the output
>>>>>                             from these options.
>>>>>                             >>>>>>
>>>>>                             >>>>>>     Barry
>>>>>                             >>>>>>
>>>>>                             >>>>>> On Jul 5, 2016, at 5:23 PM,
>>>>>                             frank <hengjiew at uci.edu> wrote:
>>>>>                             >>>>>>
>>>>>                             >>>>>> Hi,
>>>>>                             >>>>>>
>>>>>                             >>>>>> I am using the CG ksp solver
>>>>>                             and Multigrid preconditioner to solve
>>>>>                             a linear system in parallel.
>>>>>                             >>>>>> I chose to use the 'Telescope'
>>>>>                             as the preconditioner on the coarse
>>>>>                             mesh for its good performance.
>>>>>                             >>>>>> The petsc options file is attached.
>>>>>                             >>>>>>
>>>>>                             >>>>>> The domain is a 3d box.
>>>>>                             >>>>>> It works well when the grid is 
>>>>>                             1536*128*384 and the process mesh is
>>>>>                             96*8*24. When I double the size of
>>>>>                             grid and  keep the same process mesh
>>>>>                             and petsc options, I get an "out of
>>>>>                             memory" error from the super-cluster I
>>>>>                             am using.
>>>>>                             >>>>>> Each process has access to at
>>>>>                             least 8G memory, which should be more
>>>>>                             than enough for my application. I am
>>>>>                             sure that all the other parts of my
>>>>>                             code( except the linear solver ) do
>>>>>                             not use much memory. So I doubt if
>>>>>                             there is something wrong with the
>>>>>                             linear solver.
>>>>>                             >>>>>> The error occurs before the
>>>>>                             linear system is completely solved so
>>>>>                             I don't have the info from ksp view. I
>>>>>                             am not able to re-produce the error
>>>>>                             with a smaller problem either.
>>>>>                             >>>>>> In addition,  I tried to use
>>>>>                             the block jacobi as the preconditioner
>>>>>                             with the same grid and same
>>>>>                             decomposition. The linear solver runs
>>>>>                             extremely slow but there is no memory
>>>>>                             error.
>>>>>                             >>>>>>
>>>>>                             >>>>>> How can I diagnose what exactly
>>>>>                             cause the error?
>>>>>                             >>>>>> Thank you so much.
>>>>>                             >>>>>>
>>>>>                             >>>>>> Frank
>>>>>                             >>>>>> <petsc_options.txt>
>>>>>                             >>>>>>
>>>>>                             <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>
>>>>>                             >>>>>>
>>>>>                             >>>>>
>>>>>                             >>>>
>>>>>                             >>>
>>>>>                             <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><memory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt>
>>>>>                             >
>>>>>
>>>>
>>>
>>>
>>
>
>
>
>
>     -- 
>     What most experimenters take for granted before they begin their
>     experiments is infinitely more interesting than any results to
>     which their experiments lead.
>     -- Norbert Wiener
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161007/a5f66028/attachment-0001.html>