[petsc-users] Performance of the Telescope Multigrid Preconditioner

Tue Oct 4 15:31:06 CDT 2016

On Tue, Oct 4, 2016 at 3:26 PM, frank <hengjiew at uci.edu> wrote:

>
> On 10/04/2016 01:20 PM, Matthew Knepley wrote:
>
> On Tue, Oct 4, 2016 at 3:09 PM, frank <hengjiew at uci.edu> wrote:
>
>> Hi Dave,
>>
>> Thank you for the reply.
>> What do you mean by the "nested calls to KSPSolve"?
>>
>
> KSPSolve is called again after redistributing the computation.
>
>
> I am still confused. There is only one KSPSolve in my code.
>

Thats right. You call it once, but it is called internally again.

> Do you mean KSPSolve is called again in the sub-communicator? If that's
> the case, even if I put two identical KSPSolve in the code, the
> sub-communicator is still going to call KSPSolve, right?
>

Yes.

   Matt

>
>
>> I tried to call KSPSolve twice, but the the second solve converged in 0
>> iteration. KSPSolve seems to remember the solution. How can I force both
>> solves start from the same initial guess?
>>
>
> Did you zero the solution vector between solves? VecSet(x, 0.0);
>
>   Matt
>
>
>> Thank you.
>>
>> Frank
>>
>>
>>
>> On 10/04/2016 12:56 PM, Dave May wrote:
>>
>>
>>
>> On Tuesday, 4 October 2016, frank <hengjiew at uci.edu> wrote:
>>
>>> Hi,
>>> This question is follow-up of the thread "Question about memory usage in
>>> Multigrid preconditioner".
>>> I used to have the "Out of Memory(OOM)" problem when using the
>>> CG+Telescope MG solver with 32768 cores. Adding the "-matrap 0;
>>> -matptap_scalable" option did solve that problem.
>>>
>>> Then I test the scalability by solving a 3d poisson eqn for 1 step. I
>>> used one sub-communicator in all the tests. The difference between the
>>> petsc options in those tests are: 1 the pc_telescope_reduction_factor; 2
>>> the number of multigrid levels in the up/down solver. The function
>>> "ksp_solve" is timed. It is kind of slow and doesn't scale at all.
>>>
>>> Test1: 512^3 grid points
>>> Core#        telescope_reduction_factor        MG levels# for up/down
>>> solver     Time for KSPSolve (s)
>>> 512             8                                                 4 /
>>> 3                                              6.2466
>>> 4096           64                                               5 /
>>> 3                                              0.9361
>>> 32768         64                                               4 /
>>> 3                                              4.8914
>>>
>>> Test2: 1024^3 grid points
>>> Core#        telescope_reduction_factor        MG levels# for up/down
>>> solver     Time for KSPSolve (s)
>>> 4096           64                                               5 / 4
>>>                                              3.4139
>>> 8192           128                                             5 /
>>> 4                                              2.4196
>>> 16384         32                                               5 / 3
>>>                                              5.4150
>>> 32768         64                                               5 /
>>> 3                                              5.6067
>>> 65536         128                                             5 /
>>> 3                                              6.5219
>>>
>>
>> You have to be very careful how you interpret these numbers. Your solver
>> contains nested calls to KSPSolve, and unfortunately as a result the
>> numbers you report include setup time. This will remain true even if you
>> call KSPSetUp on the outermost KSP.
>>
>> Your email concerns scalability of the silver application, so let's focus
>> on that issue.
>>
>> The only way to clearly separate setup from solve time is to perform two
>> identical solves. The second solve will not require any setup. You should
>> monitor the second solve via a new PetscStage.
>>
>> This was what I did in the telescope paper. It was the only way to
>> understand the setup cost (and scaling) cf the solve time (and scaling).
>>
>> Thanks
>>   Dave
>>
>>
>>
>>> I guess I didn't set the MG levels properly. What would be the efficient
>>> way to arrange the MG levels?
>>> Also which preconditionr at the coarse mesh of the 2nd communicator
>>> should I use to improve the performance?
>>>
>>> I attached the test code and the petsc options file for the 1024^3 cube
>>> with 32768 cores.
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Frank
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 09/15/2016 03:35 AM, Dave May wrote:
>>>
>>> HI all,
>>>
>>> I the only unexpected memory usage I can see is associated with the call
>>> to MatPtAP().
>>> Here is something you can try immediately.
>>> Run your code with the additional options
>>>   -matrap 0 -matptap_scalable
>>>
>>> I didn't realize this before, but the default behaviour of MatPtAP in
>>> parallel is actually to to explicitly form the transpose of P (e.g.
>>> assemble R = P^T) and then compute R.A.P.
>>> You don't want to do this. The option -matrap 0 resolves this issue.
>>>
>>> The implementation of P^T.A.P has two variants.
>>> The scalable implementation (with respect to memory usage) is selected
>>> via the second option -matptap_scalable.
>>>
>>> Try it out - I see a significant memory reduction using these options
>>> for particular mesh sizes / partitions.
>>>
>>> I've attached a cleaned up version of the code you sent me.
>>> There were a number of memory leaks and other issues.
>>> The main points being
>>>   * You should call DMDAVecGetArrayF90() before VecAssembly{Begin,End}
>>>   * You should call PetscFinalize(), otherwise the option -log_summary
>>> (-log_view) will not display anything once the program has completed.
>>>
>>>
>>> Thanks,
>>>   Dave
>>>
>>>
>>> On 15 September 2016 at 08:03, Hengjie Wang <hengjiew at uci.edu> wrote:
>>>
>>>> Hi Dave,
>>>>
>>>> Sorry, I should have put more comment to explain the code.
>>>> The number of process in each dimension is the same: Px = Py=Pz=P. So
>>>> is the domain size.
>>>> So if the you want to run the code for a  512^3 grid points on 16^3
>>>> cores, you need to set "-N 512 -P 16" in the command line.
>>>> I add more comments and also fix an error in the attached code. ( The
>>>> error only effects the accuracy of solution but not the memory usage. )
>>>>
>>>> Thank you.
>>>> Frank
>>>>
>>>>
>>>> On 9/14/2016 9:05 PM, Dave May wrote:
>>>>
>>>>
>>>>
>>>> On Thursday, 15 September 2016, Dave May <dave.mayhem23 at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thursday, 15 September 2016, frank <hengjiew at uci.edu> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I write a simple code to re-produce the error. I hope this can help
>>>>>> to diagnose the problem.
>>>>>> The code just solves a 3d poisson equation.
>>>>>>
>>>>>
>>>>> Why is the stencil width a runtime parameter?? And why is the default
>>>>> value 2? For 7-pnt FD Laplace, you only need a stencil width of 1.
>>>>>
>>>>> Was this choice made to mimic something in the real application code?
>>>>>
>>>>
>>>> Please ignore - I misunderstood your usage of the param set by -P
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> I run the code on a 1024^3 mesh. The process partition is 32 * 32 *
>>>>>> 32. That's when I re-produce the OOM error. Each core has about 2G memory.
>>>>>> I also run the code on a 512^3 mesh with 16 * 16 * 16 processes. The
>>>>>> ksp solver works fine.
>>>>>> I attached the code, ksp_view_pre's output and my petsc option file.
>>>>>>
>>>>>> Thank you.
>>>>>> Frank
>>>>>>
>>>>>> On 09/09/2016 06:38 PM, Hengjie Wang wrote:
>>>>>>
>>>>>> Hi Barry,
>>>>>>
>>>>>> I checked. On the supercomputer, I had the option "-ksp_view_pre" but
>>>>>> it is not in file I sent you. I am sorry for the confusion.
>>>>>>
>>>>>> Regards,
>>>>>> Frank
>>>>>>
>>>>>> On Friday, September 9, 2016, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>>>
>>>>>>>
>>>>>>> > On Sep 9, 2016, at 3:11 PM, frank <hengjiew at uci.edu> wrote:
>>>>>>> >
>>>>>>> > Hi Barry,
>>>>>>> >
>>>>>>> > I think the first KSP view output is from -ksp_view_pre. Before I
>>>>>>> submitted the test, I was not sure whether there would be OOM error or not.
>>>>>>> So I added both -ksp_view_pre and -ksp_view.
>>>>>>>
>>>>>>>   But the options file you sent specifically does NOT list the
>>>>>>> -ksp_view_pre so how could it be from that?
>>>>>>>
>>>>>>>    Sorry to be pedantic but I've spent too much time in the past
>>>>>>> trying to debug from incorrect information and want to make sure that the
>>>>>>> information I have is correct before thinking. Please recheck exactly what
>>>>>>> happened. Rerun with the exact input file you emailed if that is needed.
>>>>>>>
>>>>>>>    Barry
>>>>>>>
>>>>>>> >
>>>>>>> > Frank
>>>>>>> >
>>>>>>> >
>>>>>>> > On 09/09/2016 12:38 PM, Barry Smith wrote:
>>>>>>> >>   Why does ksp_view2.txt have two KSP views in it while
>>>>>>> ksp_view1.txt has only one KSPView in it? Did you run two different solves
>>>>>>> in the 2 case but not the one?
>>>>>>> >>
>>>>>>> >>   Barry
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>> On Sep 9, 2016, at 10:56 AM, frank <hengjiew at uci.edu> wrote:
>>>>>>> >>>
>>>>>>> >>> Hi,
>>>>>>> >>>
>>>>>>> >>> I want to continue digging into the memory problem here.
>>>>>>> >>> I did find a work around in the past, which is to use less cores
>>>>>>> per node so that each core has 8G memory. However this is deficient and
>>>>>>> expensive. I hope to locate the place that uses the most memory.
>>>>>>> >>>
>>>>>>> >>> Here is a brief summary of the tests I did in past:
>>>>>>> >>>> Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12
>>>>>>> >>> Maximum (over computational time) process memory:
>>>>>>>  total 7.0727e+08
>>>>>>> >>> Current process memory:
>>>>>>>                total 7.0727e+08
>>>>>>> >>> Maximum (over computational time) space PetscMalloc()ed:  total
>>>>>>> 6.3908e+11
>>>>>>> >>> Current space PetscMalloc()ed:
>>>>>>>               total 1.8275e+09
>>>>>>> >>>
>>>>>>> >>>> Test2:    Mesh 1536*128*384  |  Process Mesh 96*8*24
>>>>>>> >>> Maximum (over computational time) process memory:
>>>>>>>  total 5.9431e+09
>>>>>>> >>> Current process memory:
>>>>>>>                total 5.9431e+09
>>>>>>> >>> Maximum (over computational time) space PetscMalloc()ed:  total
>>>>>>> 5.3202e+12
>>>>>>> >>> Current space PetscMalloc()ed:
>>>>>>>                total 5.4844e+09
>>>>>>> >>>
>>>>>>> >>>> Test3:    Mesh 3072*256*768  |  Process Mesh 96*8*24
>>>>>>> >>>     OOM( Out Of Memory ) killer of the supercomputer terminated
>>>>>>> the job during "KSPSolve".
>>>>>>> >>>
>>>>>>> >>> I attached the output of ksp_view( the third test's output is
>>>>>>> from ksp_view_pre ), memory_view and also the petsc options.
>>>>>>> >>>
>>>>>>> >>> In all the tests, each core can access about 2G memory. In
>>>>>>> test3, there are 4223139840 non-zeros in the matrix. This will consume
>>>>>>> about 1.74M, using double precision. Considering some extra memory used to
>>>>>>> store integer index, 2G memory should still be way enough.
>>>>>>> >>>
>>>>>>> >>> Is there a way to find out which part of KSPSolve uses the most
>>>>>>> memory?
>>>>>>> >>> Thank you so much.
>>>>>>> >>>
>>>>>>> >>> BTW, there are 4 options remains unused and I don't understand
>>>>>>> why they are omitted:
>>>>>>> >>> -mg_coarse_telescope_mg_coarse_ksp_type value: preonly
>>>>>>> >>> -mg_coarse_telescope_mg_coarse_pc_type value: bjacobi
>>>>>>> >>> -mg_coarse_telescope_mg_levels_ksp_max_it value: 1
>>>>>>> >>> -mg_coarse_telescope_mg_levels_ksp_type value: richardson
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> Regards,
>>>>>>> >>> Frank
>>>>>>> >>>
>>>>>>> >>> On 07/13/2016 05:47 PM, Dave May wrote:
>>>>>>> >>>>
>>>>>>> >>>> On 14 July 2016 at 01:07, frank <hengjiew at uci.edu> wrote:
>>>>>>> >>>> Hi Dave,
>>>>>>> >>>>
>>>>>>> >>>> Sorry for the late reply.
>>>>>>> >>>> Thank you so much for your detailed reply.
>>>>>>> >>>>
>>>>>>> >>>> I have a question about the estimation of the memory usage.
>>>>>>> There are 4223139840 allocated non-zeros and 18432 MPI processes. Double
>>>>>>> precision is used. So the memory per process is:
>>>>>>> >>>>   4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?
>>>>>>> >>>> Did I do sth wrong here? Because this seems too small.
>>>>>>> >>>>
>>>>>>> >>>> No - I totally f***ed it up. You are correct. That'll teach me
>>>>>>> for fumbling around with my iphone calculator and not using my brain. (Note
>>>>>>> that to convert to MB just divide by 1e6, not 1024^2 - although I
>>>>>>> apparently cannot convert between units correctly....)
>>>>>>> >>>>
>>>>>>> >>>> From the PETSc objects associated with the solver, It looks
>>>>>>> like it _should_ run with 2GB per MPI rank. Sorry for my mistake.
>>>>>>> Possibilities are: somewhere in your usage of PETSc you've introduced a
>>>>>>> memory leak; PETSc is doing a huge over allocation (e.g. as per our
>>>>>>> discussion of MatPtAP); or in your application code there are other objects
>>>>>>> you have forgotten to log the memory for.
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> I am running this job on Bluewater
>>>>>>> >>>> I am using the 7 points FD stencil in 3D.
>>>>>>> >>>>
>>>>>>> >>>> I thought so on both counts.
>>>>>>> >>>>
>>>>>>> >>>> I apologize that I made a stupid mistake in computing the
>>>>>>> memory per core. My settings render each core can access only 2G memory on
>>>>>>> average instead of 8G which I mentioned in previous email. I re-run the job
>>>>>>> with 8G memory per core on average and there is no "Out Of Memory" error. I
>>>>>>> would do more test to see if there is still some memory issue.
>>>>>>> >>>>
>>>>>>> >>>> Ok. I'd still like to know where the memory was being used
>>>>>>> since my estimates were off.
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> Thanks,
>>>>>>> >>>>   Dave
>>>>>>> >>>>
>>>>>>> >>>> Regards,
>>>>>>> >>>> Frank
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> On 07/11/2016 01:18 PM, Dave May wrote:
>>>>>>> >>>>> Hi Frank,
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> On 11 July 2016 at 19:14, frank <hengjiew at uci.edu> wrote:
>>>>>>> >>>>> Hi Dave,
>>>>>>> >>>>>
>>>>>>> >>>>> I re-run the test using bjacobi as the preconditioner on the
>>>>>>> coarse mesh of telescope. The Grid is 3072*256*768 and process mesh is
>>>>>>> 96*8*24. The petsc option file is attached.
>>>>>>> >>>>> I still got the "Out Of Memory" error. The error occurred
>>>>>>> before the linear solver finished one step. So I don't have the full info
>>>>>>> from ksp_view. The info from ksp_view_pre is attached.
>>>>>>> >>>>>
>>>>>>> >>>>> Okay - that is essentially useless (sorry)
>>>>>>> >>>>>
>>>>>>> >>>>> It seems to me that the error occurred when the decomposition
>>>>>>> was going to be changed.
>>>>>>> >>>>>
>>>>>>> >>>>> Based on what information?
>>>>>>> >>>>> Running with -info would give us more clues, but will create a
>>>>>>> ton of output.
>>>>>>> >>>>> Please try running the case which failed with -info
>>>>>>> >>>>>  I had another test with a grid of 1536*128*384 and the same
>>>>>>> process mesh as above. There was no error. The ksp_view info is attached
>>>>>>> for comparison.
>>>>>>> >>>>> Thank you.
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> [3] Here is my crude estimate of your memory usage.
>>>>>>> >>>>> I'll target the biggest memory hogs only to get an order of
>>>>>>> magnitude estimate
>>>>>>> >>>>>
>>>>>>> >>>>> * The Fine grid operator contains 4223139840 non-zeros --> 1.8
>>>>>>> GB per MPI rank assuming double precision.
>>>>>>> >>>>> The indices for the AIJ could amount to another 0.3 GB
>>>>>>> (assuming 32 bit integers)
>>>>>>> >>>>>
>>>>>>> >>>>> * You use 5 levels of coarsening, so the other operators
>>>>>>> should represent (collectively)
>>>>>>> >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300 MB per MPI rank
>>>>>>> on the communicator with 18432 ranks.
>>>>>>> >>>>> The coarse grid should consume ~ 0.5 MB per MPI rank on the
>>>>>>> communicator with 18432 ranks.
>>>>>>> >>>>>
>>>>>>> >>>>> * You use a reduction factor of 64, making the new
>>>>>>> communicator with 288 MPI ranks.
>>>>>>> >>>>> PCTelescope will first gather a temporary matrix associated
>>>>>>> with your coarse level operator assuming a comm size of 288 living on the
>>>>>>> comm with size 18432.
>>>>>>> >>>>> This matrix will require approximately 0.5 * 64 = 32 MB per
>>>>>>> core on the 288 ranks.
>>>>>>> >>>>> This matrix is then used to form a new MPIAIJ matrix on the
>>>>>>> subcomm, thus require another 32 MB per rank.
>>>>>>> >>>>> The temporary matrix is now destroyed.
>>>>>>> >>>>>
>>>>>>> >>>>> * Because a DMDA is detected, a permutation matrix is
>>>>>>> assembled.
>>>>>>> >>>>> This requires 2 doubles per point in the DMDA.
>>>>>>> >>>>> Your coarse DMDA contains 92 x 16 x 48 points.
>>>>>>> >>>>> Thus the permutation matrix will require < 1 MB per MPI rank
>>>>>>> on the sub-comm.
>>>>>>> >>>>>
>>>>>>> >>>>> * Lastly, the matrix is permuted. This uses MatPtAP(), but the
>>>>>>> resulting operator will have the same memory footprint as the unpermuted
>>>>>>> matrix (32 MB). At any stage in PCTelescope, only 2 operators of size 32 MB
>>>>>>> are held in memory when the DMDA is provided.
>>>>>>> >>>>>
>>>>>>> >>>>> From my rough estimates, the worst case memory foot print for
>>>>>>> any given core, given your options is approximately
>>>>>>> >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB
>>>>>>> >>>>> This is way below 8 GB.
>>>>>>> >>>>>
>>>>>>> >>>>> Note this estimate completely ignores:
>>>>>>> >>>>> (1) the memory required for the restriction operator,
>>>>>>> >>>>> (2) the potential growth in the number of non-zeros per row
>>>>>>> due to Galerkin coarsening (I wished -ksp_view_pre reported the output from
>>>>>>> MatView so we could see the number of non-zeros required by the coarse
>>>>>>> level operators)
>>>>>>> >>>>> (3) all temporary vectors required by the CG solver, and those
>>>>>>> required by the smoothers.
>>>>>>> >>>>> (4) internal memory allocated by MatPtAP
>>>>>>> >>>>> (5) memory associated with IS's used within PCTelescope
>>>>>>> >>>>>
>>>>>>> >>>>> So either I am completely off in my estimates, or you have not
>>>>>>> carefully estimated the memory usage of your application code. Hopefully
>>>>>>> others might examine/correct my rough estimates
>>>>>>> >>>>>
>>>>>>> >>>>> Since I don't have your code I cannot access the latter.
>>>>>>> >>>>> Since I don't have access to the same machine you are running
>>>>>>> on, I think we need to take a step back.
>>>>>>> >>>>>
>>>>>>> >>>>> [1] What machine are you running on? Send me a URL if its
>>>>>>> available
>>>>>>> >>>>>
>>>>>>> >>>>> [2] What discretization are you using? (I am guessing a scalar
>>>>>>> 7 point FD stencil)
>>>>>>> >>>>> If it's a 7 point FD stencil, we should be able to examine the
>>>>>>> memory usage of your solver configuration using a standard, light weight
>>>>>>> existing PETSc example, run on your machine at the same scale.
>>>>>>> >>>>> This would hopefully enable us to correctly evaluate the
>>>>>>> actual memory usage required by the solver configuration you are using.
>>>>>>> >>>>>
>>>>>>> >>>>> Thanks,
>>>>>>> >>>>>   Dave
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> Frank
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> On 07/08/2016 10:38 PM, Dave May wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>> On Saturday, 9 July 2016, frank <hengjiew at uci.edu> wrote:
>>>>>>> >>>>>> Hi Barry and Dave,
>>>>>>> >>>>>>
>>>>>>> >>>>>> Thank both of you for the advice.
>>>>>>> >>>>>>
>>>>>>> >>>>>> @Barry
>>>>>>> >>>>>> I made a mistake in the file names in last email. I attached
>>>>>>> the correct files this time.
>>>>>>> >>>>>> For all the three tests, 'Telescope' is used as the coarse
>>>>>>> preconditioner.
>>>>>>> >>>>>>
>>>>>>> >>>>>> == Test1:   Grid: 1536*128*384,   Process Mesh: 48*4*12
>>>>>>> >>>>>> Part of the memory usage:  Vector   125            124
>>>>>>> 3971904     0.
>>>>>>> >>>>>>                                              Matrix   101
>>>>>>> 101      9462372     0
>>>>>>> >>>>>>
>>>>>>> >>>>>> == Test2: Grid: 1536*128*384,   Process Mesh: 96*8*24
>>>>>>> >>>>>> Part of the memory usage:  Vector   125            124
>>>>>>> 681672     0.
>>>>>>> >>>>>>                                              Matrix   101
>>>>>>> 101      1462180     0.
>>>>>>> >>>>>>
>>>>>>> >>>>>> In theory, the memory usage in Test1 should be 8 times of
>>>>>>> Test2. In my case, it is about 6 times.
>>>>>>> >>>>>>
>>>>>>> >>>>>> == Test3: Grid: 3072*256*768,   Process Mesh: 96*8*24.
>>>>>>> Sub-domain per process: 32*32*32
>>>>>>> >>>>>> Here I get the out of memory error.
>>>>>>> >>>>>>
>>>>>>> >>>>>> I tried to use -mg_coarse jacobi. In this way, I don't need
>>>>>>> to set -mg_coarse_ksp_type and -mg_coarse_pc_type explicitly, right?
>>>>>>> >>>>>> The linear solver didn't work in this case. Petsc output some
>>>>>>> errors.
>>>>>>> >>>>>>
>>>>>>> >>>>>> @Dave
>>>>>>> >>>>>> In test3, I use only one instance of 'Telescope'. On the
>>>>>>> coarse mesh of 'Telescope', I used LU as the preconditioner instead of SVD.
>>>>>>> >>>>>> If my set the levels correctly, then on the last coarse mesh
>>>>>>> of MG where it calls 'Telescope', the sub-domain per process is 2*2*2.
>>>>>>> >>>>>> On the last coarse mesh of 'Telescope', there is only one
>>>>>>> grid point per process.
>>>>>>> >>>>>> I still got the OOM error. The detailed petsc option file is
>>>>>>> attached.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Do you understand the expected memory usage for the
>>>>>>> particular parallel LU implementation you are using? I don't (seriously).
>>>>>>> Replace LU with bjacobi and re-run this test. My point about solver
>>>>>>> debugging is still valid.
>>>>>>> >>>>>>
>>>>>>> >>>>>> And please send the result of KSPView so we can see what is
>>>>>>> actually used in the computations
>>>>>>> >>>>>>
>>>>>>> >>>>>> Thanks
>>>>>>> >>>>>>   Dave
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>> Thank you so much.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Frank
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:
>>>>>>> >>>>>> On Jul 6, 2016, at 4:19 PM, frank <hengjiew at uci.edu> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>> Hi Barry,
>>>>>>> >>>>>>
>>>>>>> >>>>>> Thank you for you advice.
>>>>>>> >>>>>> I tried three test. In the 1st test, the grid is 3072*256*768
>>>>>>> and the process mesh is 96*8*24.
>>>>>>> >>>>>> The linear solver is 'cg' the preconditioner is 'mg' and
>>>>>>> 'telescope' is used as the preconditioner at the coarse mesh.
>>>>>>> >>>>>> The system gives me the "Out of Memory" error before the
>>>>>>> linear system is completely solved.
>>>>>>> >>>>>> The info from '-ksp_view_pre' is attached. I seems to me that
>>>>>>> the error occurs when it reaches the coarse mesh.
>>>>>>> >>>>>>
>>>>>>> >>>>>> The 2nd test uses a grid of 1536*128*384 and process mesh is
>>>>>>> 96*8*24. The 3rd                                             test uses the
>>>>>>> same grid but a different process mesh 48*4*12.
>>>>>>> >>>>>>     Are you sure this is right? The total matrix and vector
>>>>>>> memory usage goes from 2nd test
>>>>>>> >>>>>>                Vector   384            383      8,193,712
>>>>>>>  0.
>>>>>>> >>>>>>                Matrix   103            103     11,508,688
>>>>>>>  0.
>>>>>>> >>>>>> to 3rd test
>>>>>>> >>>>>>               Vector   384            383      1,590,520
>>>>>>>  0.
>>>>>>> >>>>>>                Matrix   103            103      3,508,664
>>>>>>>  0.
>>>>>>> >>>>>> that is the memory usage got smaller but if you have only
>>>>>>> 1/8th the processes and the same grid it should have gotten about 8 times
>>>>>>> bigger. Did you maybe cut the grid by a factor of 8 also? If so that still
>>>>>>> doesn't explain it because the memory usage changed by a factor of 5
>>>>>>> something for the vectors and 3 something for the matrices.
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>> The linear solver and petsc options in 2nd and 3rd tests are
>>>>>>> the same in 1st test. The linear solver works fine in both test.
>>>>>>> >>>>>> I attached the memory usage of the 2nd and 3rd tests. The
>>>>>>> memory info is from the option '-log_summary'. I tried to use
>>>>>>> '-momery_info' as you suggested, but in my case petsc treated it as an
>>>>>>> unused option. It output nothing about the memory. Do I need to add sth to
>>>>>>> my code so I can use '-memory_info'?
>>>>>>> >>>>>>     Sorry, my mistake the option is -memory_view
>>>>>>> >>>>>>
>>>>>>> >>>>>>    Can you run the one case with -memory_view and -mg_coarse
>>>>>>> jacobi -ksp_max_it 1 (just so it doesn't iterate forever) to see how much
>>>>>>> memory is used without the telescope? Also run case 2 the same way.
>>>>>>> >>>>>>
>>>>>>> >>>>>>    Barry
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>> In both tests the memory usage is not large.
>>>>>>> >>>>>>
>>>>>>> >>>>>> It seems to me that it might be the 'telescope'
>>>>>>> preconditioner that allocated a lot of memory and caused the error in the
>>>>>>> 1st test.
>>>>>>> >>>>>> Is there is a way to show how much memory it allocated?
>>>>>>> >>>>>>
>>>>>>> >>>>>> Frank
>>>>>>> >>>>>>
>>>>>>> >>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote:
>>>>>>> >>>>>>    Frank,
>>>>>>> >>>>>>
>>>>>>> >>>>>>      You can run with -ksp_view_pre to have it "view" the KSP
>>>>>>> before the solve so hopefully it gets that far.
>>>>>>> >>>>>>
>>>>>>> >>>>>>       Please run the problem that does fit with -memory_info
>>>>>>> when the problem completes it will show the "high water mark" for PETSc
>>>>>>> allocated memory and total memory used. We first want to look at these
>>>>>>> numbers to see if it is using more memory than you expect. You could also
>>>>>>> run with say half the grid spacing to see how the memory usage scaled with
>>>>>>> the increase in grid points. Make the runs also with -log_view and send all
>>>>>>> the output from these options.
>>>>>>> >>>>>>
>>>>>>> >>>>>>     Barry
>>>>>>> >>>>>>
>>>>>>> >>>>>> On Jul 5, 2016, at 5:23 PM, frank <hengjiew at uci.edu> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>> Hi,
>>>>>>> >>>>>>
>>>>>>> >>>>>> I am using the CG ksp solver and Multigrid preconditioner  to
>>>>>>> solve a linear system in parallel.
>>>>>>> >>>>>> I chose to use the 'Telescope' as the preconditioner on the
>>>>>>> coarse mesh for its good performance.
>>>>>>> >>>>>> The petsc options file is attached.
>>>>>>> >>>>>>
>>>>>>> >>>>>> The domain is a 3d box.
>>>>>>> >>>>>> It works well when the grid is  1536*128*384 and the process
>>>>>>> mesh is 96*8*24. When I double the size of grid and
>>>>>>>                          keep the same process mesh and petsc options, I
>>>>>>> get an "out of memory" error from the super-cluster I am using.
>>>>>>> >>>>>> Each process has access to at least 8G memory, which should
>>>>>>> be more than enough for my application. I am sure that all the other parts
>>>>>>> of my code( except the linear solver ) do not use much memory. So I doubt
>>>>>>> if there is something wrong with the linear solver.
>>>>>>> >>>>>> The error occurs before the linear system is completely
>>>>>>> solved so I don't have the info from ksp view. I am not able to re-produce
>>>>>>> the error with a smaller problem either.
>>>>>>> >>>>>> In addition,  I tried to use the block jacobi as the
>>>>>>> preconditioner with the same grid and same decomposition. The linear solver
>>>>>>> runs extremely slow but there is no memory error.
>>>>>>> >>>>>>
>>>>>>> >>>>>> How can I diagnose what exactly cause the error?
>>>>>>> >>>>>> Thank you so much.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Frank
>>>>>>> >>>>>> <petsc_options.txt>
>>>>>>> >>>>>> <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc
>>>>>>> _options.txt>
>>>>>>> >>>>>>
>>>>>>> >>>>>
>>>>>>> >>>>
>>>>>>> >>> <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><m
>>>>>>> emory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_op
>>>>>>> tions3.txt>
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20161004/85057cf1/attachment-0001.html>