<br><br>On Thursday, 15 September 2016, Dave May <<a href="mailto:dave.mayhem23@gmail.com">dave.mayhem23@gmail.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br><br>On Thursday, 15 September 2016, frank <<a href="javascript:_e(%7B%7D,'cvml','hengjiew@uci.edu');" target="_blank">hengjiew@uci.edu</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000">

    Hi, <br>

    <br>

    I write a simple code to re-produce the error. I hope this can help

    to diagnose the problem.<br>

    The code just solves a 3d poisson equation. </div></blockquote><div><br></div><div>Why is the stencil width a runtime parameter?? And why is the default value 2? For 7-pnt FD Laplace, you only need a stencil width of 1. </div><div><br></div><div>Was this choice made to mimic something in the real application code?</div></blockquote><div><br></div>Please ignore - I misunderstood your usage of the param set by -P<div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><br>

    I run the code on a 1024^3 mesh. The process partition is 32 * 32 *

    32. That's when I re-produce the OOM error. Each core has about 2G

    memory.<br>

    I also run the code on a 512^3 mesh with 16 * 16 * 16 processes. The

    ksp solver works fine. <br>

    I attached the code, ksp_view_pre's output and my petsc option file.<br>

    <br>

    Thank you.<br>

    Frank<br>

    <div><br>

      On 09/09/2016 06:38 PM, Hengjie Wang wrote:<br>

    </div>

    <blockquote type="cite">Hi Barry, 

      <div><br>

      </div>

      <div>I checked. On the supercomputer, I had the option

        "-ksp_view_pre" but it is not in file I sent you. I am sorry for

        the confusion.</div>

      <div><br>

      </div>

      <div>Regards,</div>

      <div>Frank<span></span><br>

        <br>

        On Friday, September 9, 2016, Barry Smith <<a>bsmith@mcs.anl.gov</a>>

        wrote:<br>

        <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

          > On Sep 9, 2016, at 3:11 PM, frank <<a>hengjiew@uci.edu</a>>

          wrote:<br>

          ><br>

          > Hi Barry,<br>

          ><br>

          > I think the first KSP view output is from -ksp_view_pre.

          Before I submitted the test, I was not sure whether there

          would be OOM error or not. So I added both -ksp_view_pre and

          -ksp_view.<br>

          <br>

            But the options file you sent specifically does NOT list the

          -ksp_view_pre so how could it be from that?<br>

          <br>

             Sorry to be pedantic but I've spent too much time in the

          past trying to debug from incorrect information and want to

          make sure that the information I have is correct before

          thinking. Please recheck exactly what happened. Rerun with the

          exact input file you emailed if that is needed.<br>

          <br>

             Barry<br>

          <br>

          ><br>

          > Frank<br>

          ><br>

          ><br>

          > On 09/09/2016 12:38 PM, Barry Smith wrote:<br>

          >>   Why does ksp_view2.txt have two KSP views in it

          while ksp_view1.txt has only one KSPView in it? Did you run

          two different solves in the 2 case but not the one?<br>

          >><br>

          >>   Barry<br>

          >><br>

          >><br>

          >><br>

          >>> On Sep 9, 2016, at 10:56 AM, frank <<a>hengjiew@uci.edu</a>>

          wrote:<br>

          >>><br>

          >>> Hi,<br>

          >>><br>

          >>> I want to continue digging into the memory

          problem here.<br>

          >>> I did find a work around in the past, which is to

          use less cores per node so that each core has 8G memory.

          However this is deficient and expensive. I hope to locate the

          place that uses the most memory.<br>

          >>><br>

          >>> Here is a brief summary of the tests I did in

          past:<br>

          >>>> Test1:   Mesh 1536*128*384  |  Process Mesh

          48*4*12<br>

          >>> Maximum (over computational time) process

          memory:           total 7.0727e+08<br>

          >>> Current process memory:                         

                                         total 7.0727e+08<br>

          >>> Maximum (over computational time) space

          PetscMalloc()ed:  total 6.3908e+11<br>

          >>> Current space PetscMalloc()ed:                   

                                      total 1.8275e+09<br>

          >>><br>

          >>>> Test2:    Mesh 1536*128*384  |  Process Mesh

          96*8*24<br>

          >>> Maximum (over computational time) process

          memory:           total 5.9431e+09<br>

          >>> Current process memory:                         

                                         total 5.9431e+09<br>

          >>> Maximum (over computational time) space

          PetscMalloc()ed:  total 5.3202e+12<br>

          >>> Current space PetscMalloc()ed:                   

                                       total 5.4844e+09<br>

          >>><br>

          >>>> Test3:    Mesh 3072*256*768  |  Process Mesh

          96*8*24<br>

          >>>     OOM( Out Of Memory ) killer of the

          supercomputer terminated the job during "KSPSolve".<br>

          >>><br>

          >>> I attached the output of ksp_view( the third

          test's output is from ksp_view_pre ), memory_view and also the

          petsc options.<br>

          >>><br>

          >>> In all the tests, each core can access about 2G

          memory. In test3, there are 4223139840 non-zeros in the

          matrix. This will consume about 1.74M, using double precision.

          Considering some extra memory used to store integer index, 2G

          memory should still be way enough.<br>

          >>><br>

          >>> Is there a way to find out which part of KSPSolve

          uses the most memory?<br>

          >>> Thank you so much.<br>

          >>><br>

          >>> BTW, there are 4 options remains unused and I

          don't understand why they are omitted:<br>

          >>> -mg_coarse_telescope_mg_coarse<wbr>_ksp_type

          value: preonly<br>

          >>> -mg_coarse_telescope_mg_coarse<wbr>_pc_type

          value: bjacobi<br>

          >>> -mg_coarse_telescope_mg_levels<wbr>_ksp_max_it

          value: 1<br>

          >>> -mg_coarse_telescope_mg_levels<wbr>_ksp_type

          value: richardson<br>

          >>><br>

          >>><br>

          >>> Regards,<br>

          >>> Frank<br>

          >>><br>

          >>> On 07/13/2016 05:47 PM, Dave May wrote:<br>

          >>>><br>

          >>>> On 14 July 2016 at 01:07, frank <<a>hengjiew@uci.edu</a>>

          wrote:<br>

          >>>> Hi Dave,<br>

          >>>><br>

          >>>> Sorry for the late reply.<br>

          >>>> Thank you so much for your detailed reply.<br>

          >>>><br>

          >>>> I have a question about the estimation of the

          memory usage. There are 4223139840 allocated non-zeros and

          18432 MPI processes. Double precision is used. So the memory

          per process is:<br>

          >>>>   4223139840 * 8bytes / 18432 / 1024 / 1024 =

          1.74M ?<br>

          >>>> Did I do sth wrong here? Because this seems

          too small.<br>

          >>>><br>

          >>>> No - I totally f***ed it up. You are correct.

          That'll teach me for fumbling around with my iphone calculator

          and not using my brain. (Note that to convert to MB just

          divide by 1e6, not 1024^2 - although I apparently cannot

          convert between units correctly....)<br>

          >>>><br>

          >>>> From the PETSc objects associated with the

          solver, It looks like it _should_ run with 2GB per MPI rank.

          Sorry for my mistake. Possibilities are: somewhere in your

          usage of PETSc you've introduced a memory leak; PETSc is doing

          a huge over allocation (e.g. as per our discussion of

          MatPtAP); or in your application code there are other objects

          you have forgotten to log the memory for.<br>

          >>>><br>

          >>>><br>

          >>>><br>

          >>>> I am running this job on Bluewater<br>

          >>>> I am using the 7 points FD stencil in 3D.<br>

          >>>><br>

          >>>> I thought so on both counts.<br>

          >>>><br>

          >>>> I apologize that I made a stupid mistake in

          computing the memory per core. My settings render each core

          can access only 2G memory on average instead of 8G which I

          mentioned in previous email. I re-run the job with 8G memory

          per core on average and there is no "Out Of Memory" error. I

          would do more test to see if there is still some memory issue.<br>

          >>>><br>

          >>>> Ok. I'd still like to know where the memory

          was being used since my estimates were off.<br>

          >>>><br>

          >>>><br>

          >>>> Thanks,<br>

          >>>>   Dave<br>

          >>>><br>

          >>>> Regards,<br>

          >>>> Frank<br>

          >>>><br>

          >>>><br>

          >>>><br>

          >>>> On 07/11/2016 01:18 PM, Dave May wrote:<br>

          >>>>> Hi Frank,<br>

          >>>>><br>

          >>>>><br>

          >>>>> On 11 July 2016 at 19:14, frank <<a>hengjiew@uci.edu</a>>

          wrote:<br>

          >>>>> Hi Dave,<br>

          >>>>><br>

          >>>>> I re-run the test using bjacobi as the

          preconditioner on the coarse mesh of telescope. The Grid is

          3072*256*768 and process mesh is 96*8*24. The petsc option

          file is attached.<br>

          >>>>> I still got the "Out Of Memory" error.

          The error occurred before the linear solver finished one step.

          So I don't have the full info from ksp_view. The info from

          ksp_view_pre is attached.<br>

          >>>>><br>

          >>>>> Okay - that is essentially useless

          (sorry)<br>

          >>>>><br>

          >>>>> It seems to me that the error occurred

          when the decomposition was going to be changed.<br>

          >>>>><br>

          >>>>> Based on what information?<br>

          >>>>> Running with -info would give us more

          clues, but will create a ton of output.<br>

          >>>>> Please try running the case which failed

          with -info<br>

          >>>>>  I had another test with a grid of

          1536*128*384 and the same process mesh as above. There was no

          error. The ksp_view info is attached for comparison.<br>

          >>>>> Thank you.<br>

          >>>>><br>

          >>>>><br>

          >>>>> [3] Here is my crude estimate of your

          memory usage.<br>

          >>>>> I'll target the biggest memory hogs only

          to get an order of magnitude estimate<br>

          >>>>><br>

          >>>>> * The Fine grid operator contains

          4223139840 non-zeros --> 1.8 GB per MPI rank assuming

          double precision.<br>

          >>>>> The indices for the AIJ could amount to

          another 0.3 GB (assuming 32 bit integers)<br>

          >>>>><br>

          >>>>> * You use 5 levels of coarsening, so the

          other operators should represent (collectively)<br>

          >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~

          300 MB per MPI rank on the communicator with 18432 ranks.<br>

          >>>>> The coarse grid should consume ~ 0.5 MB

          per MPI rank on the communicator with 18432 ranks.<br>

          >>>>><br>

          >>>>> * You use a reduction factor of 64,

          making the new communicator with 288 MPI ranks.<br>

          >>>>> PCTelescope will first gather a temporary

          matrix associated with your coarse level operator assuming a

          comm size of 288 living on the comm with size 18432.<br>

          >>>>> This matrix will require approximately

          0.5 * 64 = 32 MB per core on the 288 ranks.<br>

          >>>>> This matrix is then used to form a new

          MPIAIJ matrix on the subcomm, thus require another 32 MB per

          rank.<br>

          >>>>> The temporary matrix is now destroyed.<br>

          >>>>><br>

          >>>>> * Because a DMDA is detected, a

          permutation matrix is assembled.<br>

          >>>>> This requires 2 doubles per point in the

          DMDA.<br>

          >>>>> Your coarse DMDA contains 92 x 16 x 48

          points.<br>

          >>>>> Thus the permutation matrix will require

          < 1 MB per MPI rank on the sub-comm.<br>

          >>>>><br>

          >>>>> * Lastly, the matrix is permuted. This

          uses MatPtAP(), but the resulting operator will have the same

          memory footprint as the unpermuted matrix (32 MB). At any

          stage in PCTelescope, only 2 operators of size 32 MB are held

          in memory when the DMDA is provided.<br>

          >>>>><br>

          >>>>> From my rough estimates, the worst case

          memory foot print for any given core, given your options is

          approximately<br>

          >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB 

          = 2465 MB<br>

          >>>>> This is way below 8 GB.<br>

          >>>>><br>

          >>>>> Note this estimate completely ignores:<br>

          >>>>> (1) the memory required for the

          restriction operator,<br>

          >>>>> (2) the potential growth in the number of

          non-zeros per row due to Galerkin coarsening (I wished

          -ksp_view_pre reported the output from MatView so we could see

          the number of non-zeros required by the coarse level

          operators)<br>

          >>>>> (3) all temporary vectors required by the

          CG solver, and those required by the smoothers.<br>

          >>>>> (4) internal memory allocated by MatPtAP<br>

          >>>>> (5) memory associated with IS's used

          within PCTelescope<br>

          >>>>><br>

          >>>>> So either I am completely off in my

          estimates, or you have not carefully estimated the memory

          usage of your application code. Hopefully others might

          examine/correct my rough estimates<br>

          >>>>><br>

          >>>>> Since I don't have your code I cannot

          access the latter.<br>

          >>>>> Since I don't have access to the same

          machine you are running on, I think we need to take a step

          back.<br>

          >>>>><br>

          >>>>> [1] What machine are you running on? Send

          me a URL if its available<br>

          >>>>><br>

          >>>>> [2] What discretization are you using? (I

          am guessing a scalar 7 point FD stencil)<br>

          >>>>> If it's a 7 point FD stencil, we should

          be able to examine the memory usage of your solver

          configuration using a standard, light weight existing PETSc

          example, run on your machine at the same scale.<br>

          >>>>> This would hopefully enable us to

          correctly evaluate the actual memory usage required by the

          solver configuration you are using.<br>

          >>>>><br>

          >>>>> Thanks,<br>

          >>>>>   Dave<br>

          >>>>><br>

          >>>>><br>

          >>>>> Frank<br>

          >>>>><br>

          >>>>><br>

          >>>>><br>

          >>>>><br>

          >>>>> On 07/08/2016 10:38 PM, Dave May wrote:<br>

          >>>>>><br>

          >>>>>> On Saturday, 9 July 2016, frank <<a>hengjiew@uci.edu</a>>

          wrote:<br>

          >>>>>> Hi Barry and Dave,<br>

          >>>>>><br>

          >>>>>> Thank both of you for the advice.<br>

          >>>>>><br>

          >>>>>> @Barry<br>

          >>>>>> I made a mistake in the file names in

          last email. I attached the correct files this time.<br>

          >>>>>> For all the three tests, 'Telescope'

          is used as the coarse preconditioner.<br>

          >>>>>><br>

          >>>>>> == Test1:   Grid: 1536*128*384, 

           Process Mesh: 48*4*12<br>

          >>>>>> Part of the memory usage:  Vector 

           125            124 3971904     0.<br>

          >>>>>>                                     

                  Matrix   101 101      9462372     0<br>

          >>>>>><br>

          >>>>>> == Test2: Grid: 1536*128*384, 

           Process Mesh: 96*8*24<br>

          >>>>>> Part of the memory usage:  Vector 

           125            124 681672     0.<br>

          >>>>>>                                     

                  Matrix   101 101      1462180     0.<br>

          >>>>>><br>

          >>>>>> In theory, the memory usage in Test1

          should be 8 times of Test2. In my case, it is about 6 times.<br>

          >>>>>><br>

          >>>>>> == Test3: Grid: 3072*256*768, 

           Process Mesh: 96*8*24. Sub-domain per process: 32*32*32<br>

          >>>>>> Here I get the out of memory error.<br>

          >>>>>><br>

          >>>>>> I tried to use -mg_coarse jacobi. In

          this way, I don't need to set -mg_coarse_ksp_type and

          -mg_coarse_pc_type explicitly, right?<br>

          >>>>>> The linear solver didn't work in this

          case. Petsc output some errors.<br>

          >>>>>><br>

          >>>>>> @Dave<br>

          >>>>>> In test3, I use only one instance of

          'Telescope'. On the coarse mesh of 'Telescope', I used LU as

          the preconditioner instead of SVD.<br>

          >>>>>> If my set the levels correctly, then

          on the last coarse mesh of MG where it calls 'Telescope', the

          sub-domain per process is 2*2*2.<br>

          >>>>>> On the last coarse mesh of

          'Telescope', there is only one grid point per process.<br>

          >>>>>> I still got the OOM error. The

          detailed petsc option file is attached.<br>

          >>>>>><br>

          >>>>>> Do you understand the expected memory

          usage for the particular parallel LU implementation you are

          using? I don't (seriously). Replace LU with bjacobi and re-run

          this test. My point about solver debugging is still valid.<br>

          >>>>>><br>

          >>>>>> And please send the result of KSPView

          so we can see what is actually used in the computations<br>

          >>>>>><br>

          >>>>>> Thanks<br>

          >>>>>>   Dave<br>

          >>>>>><br>

          >>>>>><br>

          >>>>>> Thank you so much.<br>

          >>>>>><br>

          >>>>>> Frank<br>

          >>>>>><br>

          >>>>>><br>

          >>>>>><br>

          >>>>>> On 07/06/2016 02:51 PM, Barry Smith

          wrote:<br>

          >>>>>> On Jul 6, 2016, at 4:19 PM, frank

          <<a>hengjiew@uci.edu</a>>

          wrote:<br>

          >>>>>><br>

          >>>>>> Hi Barry,<br>

          >>>>>><br>

          >>>>>> Thank you for you advice.<br>

          >>>>>> I tried three test. In the 1st test,

          the grid is 3072*256*768 and the process mesh is 96*8*24.<br>

          >>>>>> The linear solver is 'cg' the

          preconditioner is 'mg' and 'telescope' is used as the

          preconditioner at the coarse mesh.<br>

          >>>>>> The system gives me the "Out of

          Memory" error before the linear system is completely solved.<br>

          >>>>>> The info from '-ksp_view_pre' is

          attached. I seems to me that the error occurs when it reaches

          the coarse mesh.<br>

          >>>>>><br>

          >>>>>> The 2nd test uses a grid of

          1536*128*384 and process mesh is 96*8*24. The 3rd             

                                         test uses the same grid but a

          different process mesh 48*4*12.<br>

          >>>>>>     Are you sure this is right? The

          total matrix and vector memory usage goes from 2nd test<br>

          >>>>>>                Vector   384         

            383      8,193,712     0.<br>

          >>>>>>                Matrix   103         

            103     11,508,688     0.<br>

          >>>>>> to 3rd test<br>

          >>>>>>               Vector   384           

          383      1,590,520     0.<br>

          >>>>>>                Matrix   103         

            103      3,508,664     0.<br>

          >>>>>> that is the memory usage got smaller

          but if you have only 1/8th the processes and the same grid it

          should have gotten about 8 times bigger. Did you maybe cut the

          grid by a factor of 8 also? If so that still doesn't explain

          it because the memory usage changed by a factor of 5 something

          for the vectors and 3 something for the matrices.<br>

          >>>>>><br>

          >>>>>><br>

          >>>>>> The linear solver and petsc options

          in 2nd and 3rd tests are the same in 1st test. The linear

          solver works fine in both test.<br>

          >>>>>> I attached the memory usage of the

          2nd and 3rd tests. The memory info is from the option

          '-log_summary'. I tried to use '-momery_info' as you

          suggested, but in my case petsc treated it as an unused

          option. It output nothing about the memory. Do I need to add

          sth to my code so I can use '-memory_info'?<br>

          >>>>>>     Sorry, my mistake the option is

          -memory_view<br>

          >>>>>><br>

          >>>>>>    Can you run the one case with

          -memory_view and -mg_coarse jacobi -ksp_max_it 1 (just so it

          doesn't iterate forever) to see how much memory is used

          without the telescope? Also run case 2 the same way.<br>

          >>>>>><br>

          >>>>>>    Barry<br>

          >>>>>><br>

          >>>>>><br>

          >>>>>><br>

          >>>>>> In both tests the memory usage is not

          large.<br>

          >>>>>><br>

          >>>>>> It seems to me that it might be the

          'telescope'  preconditioner that allocated a lot of memory and

          caused the error in the 1st test.<br>

          >>>>>> Is there is a way to show how much

          memory it allocated?<br>

          >>>>>><br>

          >>>>>> Frank<br>

          >>>>>><br>

          >>>>>> On 07/05/2016 03:37 PM, Barry Smith

          wrote:<br>

          >>>>>>    Frank,<br>

          >>>>>><br>

          >>>>>>      You can run with -ksp_view_pre

          to have it "view" the KSP before the solve so hopefully it

          gets that far.<br>

          >>>>>><br>

          >>>>>>       Please run the problem that

          does fit with -memory_info when the problem completes it will

          show the "high water mark" for PETSc allocated memory and

          total memory used. We first want to look at these numbers to

          see if it is using more memory than you expect. You could also

          run with say half the grid spacing to see how the memory usage

          scaled with the increase in grid points. Make the runs also

          with -log_view and send all the output from these options.<br>

          >>>>>><br>

          >>>>>>     Barry<br>

          >>>>>><br>

          >>>>>> On Jul 5, 2016, at 5:23 PM, frank

          <<a>hengjiew@uci.edu</a>>

          wrote:<br>

          >>>>>><br>

          >>>>>> Hi,<br>

          >>>>>><br>

          >>>>>> I am using the CG ksp solver and

          Multigrid preconditioner  to solve a linear system in

          parallel.<br>

          >>>>>> I chose to use the 'Telescope' as the

          preconditioner on the coarse mesh for its good performance.<br>

          >>>>>> The petsc options file is attached.<br>

          >>>>>><br>

          >>>>>> The domain is a 3d box.<br>

          >>>>>> It works well when the grid is 

          1536*128*384 and the process mesh is 96*8*24. When I double

          the size of grid and                                         

                 keep the same process mesh and petsc options, I get an

          "out of memory" error from the super-cluster I am using.<br>

          >>>>>> Each process has access to at least

          8G memory, which should be more than enough for my

          application. I am sure that all the other parts of my code(

          except the linear solver ) do not use much memory. So I doubt

          if there is something wrong with the linear solver.<br>

          >>>>>> The error occurs before the linear

          system is completely solved so I don't have the info from ksp

          view. I am not able to re-produce the error with a smaller

          problem either.<br>

          >>>>>> In addition,  I tried to use the

          block jacobi as the preconditioner with the same grid and same

          decomposition. The linear solver runs extremely slow but there

          is no memory error.<br>

          >>>>>><br>

          >>>>>> How can I diagnose what exactly cause

          the error?<br>

          >>>>>> Thank you so much.<br>

          >>>>>><br>

          >>>>>> Frank<br>

          >>>>>> <petsc_options.txt><br>

          >>>>>> <ksp_view_pre.txt><memory_test<wbr>2.txt><memory_test3.txt><petsc<wbr>_options.txt><br>

          >>>>>><br>

          >>>>><br>

          >>>><br>

          >>> <ksp_view1.txt><ksp_view2.txt><wbr><ksp_view3.txt><memory1.txt><m<wbr>emory2.txt><petsc_options1.txt<wbr>><petsc_options2.txt><petsc_op<wbr>tions3.txt><br>

          ><br>

          <br>

        </blockquote>

      </div>

    </blockquote>

    <br>

  </div></blockquote><div> </div>

</blockquote></div>