<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    Hi,<br>
    <br>
    I want to continue digging into the memory problem here.  <br>
    I did find a work around in the past, which is to use less cores per
    node so that each core has 8G memory. However this is deficient and
    expensive. I hope to locate the place that uses the most memory.<br>
    <br>
    Here is a brief summary of the tests I did in
    past:                   <br>
    > Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12 <br>
    Maximum (over computational time) process memory:           total
    7.0727e+08 <br>
    Current process
    memory:                                                        
    total 7.0727e+08 <br>
    Maximum (over computational time) space PetscMalloc()ed:  total
    6.3908e+11<br>
    Current space PetscMalloc()ed:                                     
              total 1.8275e+09 <br>
    <br>
    > Test2:    Mesh 1536*128*384  |  Process Mesh 96*8*24 <br>
    Maximum (over computational time) process memory:           total
    5.9431e+09 <br>
    Current process memory:                                            
                total 5.9431e+09<br>
    Maximum (over computational time) space PetscMalloc()ed:  total
    5.3202e+12<br>
    Current space
    PetscMalloc()ed:                                                
    total 5.4844e+09<br>
    <br>
    > Test3:    Mesh 3072*256*768  |  Process Mesh 96*8*24<br>
        OOM( Out Of Memory ) killer of the supercomputer terminated the
    job during "KSPSolve". <br>
    <br>
    I attached the output of ksp_view( the third test's output is from
    ksp_view_pre ), memory_view and also the petsc options.<br>
    <br>
    In all the tests, each core can access about 2G memory. In test3,
    there are 4223139840 non-zeros in the matrix. This will consume
    about 1.74M, using double precision. Considering some extra memory
    used to store integer index, 2G memory should still be way enough.<br>
    <br>
    Is there a way to find out which part of KSPSolve uses the most
    memory? <br>
    Thank you so much.<br>
    <br>
    BTW, there are 4 options remains unused and I don't understand why
    they are omitted:<br>
    -mg_coarse_telescope_mg_coarse_ksp_type value: preonly<br>
    -mg_coarse_telescope_mg_coarse_pc_type value: bjacobi<br>
    -mg_coarse_telescope_mg_levels_ksp_max_it value: 1<br>
    -mg_coarse_telescope_mg_levels_ksp_type value: richardson<br>
    <br>
    <br>
    Regards,<br>
    Frank<br>
    <br>
    <div class="moz-cite-prefix">On 07/13/2016 05:47 PM, Dave May wrote:<br>
    </div>
    <blockquote
cite="mid:CAJ98EDrRQfspLSv8kOuzVsXzH5bL2dfzdwu0VnhOJM2VbaxkWA@mail.gmail.com"
      type="cite">
      <div dir="ltr"><br>
        <div class="gmail_extra"><br>
          <div class="gmail_quote">On 14 July 2016 at 01:07, frank <span
              dir="ltr"><<a moz-do-not-send="true"
                href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"> Hi Dave,<br>
                <br>
                Sorry for the late reply.<br>
                Thank you so much for your detailed reply.<br>
                <br>
                I have a question about the estimation of the memory
                usage. There are 4223139840 allocated non-zeros and
                18432 MPI processes. Double precision is used. So the
                memory per process is:<br>
                  4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ? <br>
                Did I do sth wrong here? Because this seems too small.<br>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>No - I totally f***ed it up. You are correct. That'll
              teach me for fumbling around with my iphone calculator and
              not using my brain. (Note that to convert to MB just
              divide by 1e6, not 1024^2 - although I apparently cannot
              convert between units correctly....)</div>
            <div><br>
            </div>
            <div>From the PETSc objects associated with the solver, It
              looks like it _should_ run with 2GB per MPI rank. Sorry
              for my mistake. Possibilities are: somewhere in your usage
              of PETSc you've introduced a memory leak; PETSc is doing a
              huge over allocation (e.g. as per our discussion of
              MatPtAP); or in your application code there are other
              objects you have forgotten to log the memory for.</div>
            <div><br>
            </div>
            <div><br>
            </div>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"> <br>
                I am running this job on <a moz-do-not-send="true"
                  href="https://bluewaters.ncsa.illinois.edu/user-guide"
                  target="_blank">Bluewater</a> </div>
            </blockquote>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"> I am using the 7
                points FD stencil in 3D. <br>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>I thought so on both counts.</div>
            <div> </div>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"> <br>
                I apologize that I made a stupid mistake in computing
                the memory per core. My settings render each core can
                access only 2G memory on average instead of 8G which I
                mentioned in previous email. I re-run the job with 8G
                memory per core on average and there is no "Out Of
                Memory" error. I would do more test to see if there is
                still some memory issue.<br>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>Ok. I'd still like to know where the memory was being
              used since my estimates were off.</div>
            <div><br>
            </div>
            <div><br>
            </div>
            <div>Thanks,</div>
            <div>  Dave</div>
            <div> </div>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"> <br>
                Regards,<br>
                Frank
                <div>
                  <div class="h5"><br>
                    <br>
                    <br>
                    <div>On 07/11/2016 01:18 PM, Dave May wrote:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">Hi Frank,<br>
                        <br>
                        <div class="gmail_extra"><br>
                          <div class="gmail_quote">On 11 July 2016 at
                            19:14, frank <span dir="ltr"><<a
                                moz-do-not-send="true"
                                href="mailto:hengjiew@uci.edu"
                                target="_blank">hengjiew@uci.edu</a>></span>
                            wrote:<br>
                            <blockquote class="gmail_quote">
                              <div> Hi Dave,<br>
                                <br>
                                I re-run the test using bjacobi as the
                                preconditioner on the coarse mesh of
                                telescope. The Grid is 3072*256*768 and
                                process mesh is 96*8*24. The petsc
                                option file is attached.<br>
                                I still got the "Out Of Memory" error.
                                The error occurred before the linear
                                solver finished one step. So I don't
                                have the full info from ksp_view. The
                                info from ksp_view_pre is attached.</div>
                            </blockquote>
                            <div><br>
                            </div>
                            <div>Okay - that is essentially useless
                              (sorry)<br>
                            </div>
                            <div> </div>
                            <blockquote class="gmail_quote">
                              <div> <br>
                                It seems to me that the error occurred
                                when the decomposition was going to be
                                changed.<br>
                              </div>
                            </blockquote>
                            <div><br>
                            </div>
                            <div>Based on what information?<br>
                            </div>
                            <div>Running with -info would give us more
                              clues, but will create a ton of output.<br>
                            </div>
                            <div>Please try running the case which
                              failed with -info<br>
                            </div>
                            <div> </div>
                            <blockquote class="gmail_quote">
                              <div> I had another test with a grid of
                                1536*128*384 and the same process mesh
                                as above. There was no error. The
                                ksp_view info is attached for
                                comparison.<br>
                                Thank you.</div>
                            </blockquote>
                            <div><br>
                            </div>
                            <div><br>
                              [3] Here is my crude estimate of your
                              memory usage. <br>
                              I'll target the biggest memory hogs only
                              to get an order of magnitude estimate<br>
                              <br>
                              <div>* The Fine grid operator contains
                                4223139840 non-zeros --> 1.8 GB per
                                MPI rank assuming double precision.<br>
                              </div>
                              <div>The indices for the AIJ could amount
                                to another 0.3 GB (assuming 32 bit
                                integers)<br>
                              </div>
                              <div><br>
                                * You use 5 levels of coarsening, so the
                                other operators should represent
                                (collectively)  <br>
                                2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~
                                300 MB per MPI rank on the communicator
                                with 18432 ranks.<br>
                              </div>
                              <div>The coarse grid should consume ~ 0.5
                                MB per MPI rank on the communicator with
                                18432 ranks.</div>
                              <div><br>
                                * You use a reduction factor of 64,
                                making the new communicator with 288 MPI
                                ranks. <br>
                                PCTelescope will first gather a
                                temporary matrix associated with your
                                coarse level operator assuming a comm
                                size of 288 living on the comm with size
                                18432. <br>
                                This matrix will require approximately
                                0.5 * 64 = 32 MB per core on the 288
                                ranks. <br>
                                This matrix is then used to form a new
                                MPIAIJ matrix on the subcomm, thus
                                require another 32 MB per rank. <br>
                                The temporary matrix is now destroyed.<br>
                              </div>
                              <div><br>
                                * Because a DMDA is detected, a
                                permutation matrix is assembled. <br>
                                This requires 2 doubles per point in the
                                DMDA. <br>
                                Your coarse DMDA contains 92 x 16 x 48
                                points. <br>
                                Thus the permutation matrix will require
                                < 1 MB per MPI rank on the sub-comm.<br>
                                <br>
                              </div>
                              <div>* Lastly, the matrix is permuted.
                                This uses MatPtAP(), but the resulting
                                operator will have the same memory
                                footprint as the unpermuted matrix (32
                                MB). At any stage in PCTelescope, only 2
                                operators of size 32 MB are held in
                                memory when the DMDA is provided.<br>
                              </div>
                              <div><br>
                              </div>
                              <div>From my rough estimates, the worst
                                case memory foot print for any given
                                core, given your options is
                                approximately <br>
                              </div>
                              <div>2100 MB + 300 MB + 32 MB + 32 MB + 1
                                MB  = 2465 MB<br>
                              </div>
                              <div>This is way below 8 GB.<br>
                                <br>
                                Note this estimate completely ignores:<br>
                                (1) the memory required for the
                                restriction operator, <br>
                                (2) the potential growth in the number
                                of non-zeros per row due to Galerkin
                                coarsening (I wished -ksp_view_pre
                                reported the output from MatView so we
                                could see the number of non-zeros
                                required by the coarse level operators)<br>
                              </div>
                              <div>(3) all temporary vectors required by
                                the CG solver, and those required by the
                                smoothers.<br>
                              </div>
                              <div>(4) internal memory allocated by
                                MatPtAP<br>
                              </div>
                              <div>(5) memory associated with IS's used
                                within PCTelescope<br>
                              </div>
                              <div><br>
                              </div>
                              So either I am completely off in my
                              estimates, or you have not carefully
                              estimated the memory usage of your
                              application code. Hopefully others might
                              examine/correct my rough estimates<br>
                            </div>
                            <div>
                              <div><br>
                                Since I don't have your code I cannot
                                access the latter.<br>
                                Since I don't have access to the same
                                machine you are running on, I think we
                                need to take a step back.<br>
                              </div>
                              <br>
                              [1] What machine are you running on? Send
                              me a URL if its available<br>
                            </div>
                            <div><br>
                              [2] What discretization are you using? (I
                              am guessing a scalar 7 point FD stencil)<br>
                            </div>
                            <div>If it's a 7 point FD stencil, we should
                              be able to examine the memory usage of
                              your solver configuration using a
                              standard, light weight existing PETSc
                              example, run on your machine at the same
                              scale. <br>
                            </div>
                            <div>This would hopefully enable us to
                              correctly evaluate the actual memory usage
                              required by the solver configuration you
                              are using.<br>
                            </div>
                            <div><br>
                            </div>
                            <div>Thanks,<br>
                            </div>
                            <div>  Dave<br>
                            </div>
                            <div> </div>
                            <blockquote class="gmail_quote">
                              <div><span><br>
                                  <br>
                                  Frank</span>
                                <div>
                                  <div><br>
                                    <br>
                                    <br>
                                    <br>
                                    <div>On 07/08/2016 10:38 PM, Dave
                                      May wrote:<br>
                                    </div>
                                    <blockquote type="cite"><br>
                                      <br>
                                      On Saturday, 9 July 2016, frank
                                      <<a moz-do-not-send="true"
                                        href="mailto:hengjiew@uci.edu"
                                        target="_blank">hengjiew@uci.edu</a>>
                                      wrote:<br>
                                      <blockquote class="gmail_quote">Hi
                                        Barry and Dave,<br>
                                        <br>
                                        Thank both of you for the
                                        advice.<br>
                                        <br>
                                        @Barry<br>
                                        I made a mistake in the file
                                        names in last email. I attached
                                        the correct files this time.<br>
                                        For all the three tests,
                                        'Telescope' is used as the
                                        coarse preconditioner.<br>
                                        <br>
                                        == Test1:   Grid: 1536*128*384, 
                                         Process Mesh: 48*4*12<br>
                                        Part of the memory usage: 
                                        Vector   125            124
                                        3971904     0.<br>
                                                                       
                                                     Matrix   101 101   
                                          9462372     0<br>
                                        <br>
                                        == Test2: Grid: 1536*128*384, 
                                         Process Mesh: 96*8*24<br>
                                        Part of the memory usage: 
                                        Vector   125            124
                                        681672     0.<br>
                                                                       
                                                     Matrix   101 101   
                                          1462180     0.<br>
                                        <br>
                                        In theory, the memory usage in
                                        Test1 should be 8 times of
                                        Test2. In my case, it is about 6
                                        times.<br>
                                        <br>
                                        == Test3: Grid: 3072*256*768, 
                                         Process Mesh: 96*8*24.
                                        Sub-domain per process: 32*32*32<br>
                                        Here I get the out of memory
                                        error.<br>
                                        <br>
                                        I tried to use -mg_coarse
                                        jacobi. In this way, I don't
                                        need to set -mg_coarse_ksp_type
                                        and -mg_coarse_pc_type
                                        explicitly, right?<br>
                                        The linear solver didn't work in
                                        this case. Petsc output some
                                        errors.<br>
                                        <br>
                                        @Dave<br>
                                        In test3, I use only one
                                        instance of 'Telescope'. On the
                                        coarse mesh of 'Telescope', I
                                        used LU as the preconditioner
                                        instead of SVD.<br>
                                        If my set the levels correctly,
                                        then on the last coarse mesh of
                                        MG where it calls 'Telescope',
                                        the sub-domain per process is
                                        2*2*2.<br>
                                        On the last coarse mesh of
                                        'Telescope', there is only one
                                        grid point per process.<br>
                                        I still got the OOM error. The
                                        detailed petsc option file is
                                        attached.</blockquote>
                                      <div><br>
                                      </div>
                                      <div>Do you understand the
                                        expected memory usage for the
                                        particular parallel
                                        LU implementation you are using?
                                        I don't (seriously). Replace LU
                                        with bjacobi and re-run this
                                        test. My point about solver
                                        debugging is still valid. </div>
                                      <div><br>
                                      </div>
                                      <div>And please send the result of
                                        KSPView so we can see what is
                                        actually used in the
                                        computations</div>
                                      <div><br>
                                      </div>
                                      <div>Thanks</div>
                                      <div>  Dave</div>
                                      <div> </div>
                                      <blockquote class="gmail_quote"> <br>
                                        <br>
                                        Thank you so much.<br>
                                        <br>
                                        Frank<br>
                                        <br>
                                        <br>
                                        <br>
                                        On 07/06/2016 02:51 PM, Barry
                                        Smith wrote:<br>
                                        <blockquote class="gmail_quote">
                                          <blockquote
                                            class="gmail_quote"> On Jul
                                            6, 2016, at 4:19 PM, frank
                                            <<a
                                              moz-do-not-send="true"
                                              href="mailto:hengjiew@uci.edu"
                                              target="_blank">hengjiew@uci.edu</a>>
                                            wrote:<br>
                                            <br>
                                            Hi Barry,<br>
                                            <br>
                                            Thank you for you advice.<br>
                                            I tried three test. In the
                                            1st test, the grid is
                                            3072*256*768 and the process
                                            mesh is 96*8*24.<br>
                                            The linear solver is 'cg'
                                            the preconditioner is 'mg'
                                            and 'telescope' is used as
                                            the preconditioner at the
                                            coarse mesh.<br>
                                            The system gives me the "Out
                                            of Memory" error before the
                                            linear system is completely
                                            solved.<br>
                                            The info from
                                            '-ksp_view_pre' is attached.
                                            I seems to me that the error
                                            occurs when it reaches the
                                            coarse mesh.<br>
                                            <br>
                                            The 2nd test uses a grid of
                                            1536*128*384 and process
                                            mesh is 96*8*24. The 3rd
                                            test uses the same grid but
                                            a different process mesh
                                            48*4*12.<br>
                                          </blockquote>
                                              Are you sure this is
                                          right? The total matrix and
                                          vector memory usage goes from
                                          2nd test<br>
                                                         Vector   384   
                                                  383      8,193,712   
                                           0.<br>
                                                         Matrix   103   
                                                  103     11,508,688   
                                           0.<br>
                                          to 3rd test<br>
                                                        Vector   384   
                                                  383      1,590,520   
                                           0.<br>
                                                         Matrix   103   
                                                  103      3,508,664   
                                           0.<br>
                                          that is the memory usage got
                                          smaller but if you have only
                                          1/8th the processes and the
                                          same grid it should have
                                          gotten about 8 times bigger.
                                          Did you maybe cut the grid by
                                          a factor of 8 also? If so that
                                          still doesn't explain it
                                          because the memory usage
                                          changed by a factor of 5
                                          something for the vectors and
                                          3 something for the matrices.<br>
                                          <br>
                                          <br>
                                          <blockquote
                                            class="gmail_quote"> The
                                            linear solver and petsc
                                            options in 2nd and 3rd tests
                                            are the same in 1st test.
                                            The linear solver works fine
                                            in both test.<br>
                                            I attached the memory usage
                                            of the 2nd and 3rd tests.
                                            The memory info is from the
                                            option '-log_summary'. I
                                            tried to use '-momery_info'
                                            as you suggested, but in my
                                            case petsc treated it as an
                                            unused option. It output
                                            nothing about the memory. Do
                                            I need to add sth to my code
                                            so I can use '-memory_info'?<br>
                                          </blockquote>
                                              Sorry, my mistake the
                                          option is -memory_view<br>
                                          <br>
                                             Can you run the one case
                                          with -memory_view and
                                          -mg_coarse jacobi -ksp_max_it
                                          1 (just so it doesn't iterate
                                          forever) to see how much
                                          memory is used without the
                                          telescope? Also run case 2 the
                                          same way.<br>
                                          <br>
                                             Barry<br>
                                          <br>
                                          <br>
                                          <br>
                                          <blockquote
                                            class="gmail_quote"> In both
                                            tests the memory usage is
                                            not large.<br>
                                            <br>
                                            It seems to me that it might
                                            be the 'telescope' 
                                            preconditioner that
                                            allocated a lot of memory
                                            and caused the error in the
                                            1st test.<br>
                                            Is there is a way to show
                                            how much memory it
                                            allocated?<br>
                                            <br>
                                            Frank<br>
                                            <br>
                                            On 07/05/2016 03:37 PM,
                                            Barry Smith wrote:<br>
                                            <blockquote
                                              class="gmail_quote">  
                                               Frank,<br>
                                              <br>
                                                   You can run with
                                              -ksp_view_pre to have it
                                              "view" the KSP before the
                                              solve so hopefully it gets
                                              that far.<br>
                                              <br>
                                                    Please run the
                                              problem that does fit with
                                              -memory_info when the
                                              problem completes it will
                                              show the "high water mark"
                                              for PETSc allocated memory
                                              and total memory used. We
                                              first want to look at
                                              these numbers to see if it
                                              is using more memory than
                                              you expect. You could also
                                              run with say half the grid
                                              spacing to see how the
                                              memory usage scaled with
                                              the increase in grid
                                              points. Make the runs also
                                              with -log_view and send
                                              all the output from these
                                              options.<br>
                                              <br>
                                                  Barry<br>
                                              <br>
                                              <blockquote
                                                class="gmail_quote"> On
                                                Jul 5, 2016, at 5:23 PM,
                                                frank <<a
                                                  moz-do-not-send="true"
href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>>
                                                wrote:<br>
                                                <br>
                                                Hi,<br>
                                                <br>
                                                I am using the CG ksp
                                                solver and Multigrid
                                                preconditioner  to solve
                                                a linear system in
                                                parallel.<br>
                                                I chose to use the
                                                'Telescope' as the
                                                preconditioner on the
                                                coarse mesh for its good
                                                performance.<br>
                                                The petsc options file
                                                is attached.<br>
                                                <br>
                                                The domain is a 3d box.<br>
                                                It works well when the
                                                grid is  1536*128*384
                                                and the process mesh is
                                                96*8*24. When I double
                                                the size of grid and
                                                keep the same process
                                                mesh and petsc options,
                                                I get an "out of memory"
                                                error from the
                                                super-cluster I am
                                                using.<br>
                                                Each process has access
                                                to at least 8G memory,
                                                which should be more
                                                than enough for my
                                                application. I am sure
                                                that all the other parts
                                                of my code( except the
                                                linear solver ) do not
                                                use much memory. So I
                                                doubt if there is
                                                something wrong with the
                                                linear solver.<br>
                                                The error occurs before
                                                the linear system is
                                                completely solved so I
                                                don't have the info from
                                                ksp view. I am not able
                                                to re-produce the error
                                                with a smaller problem
                                                either.<br>
                                                In addition,  I tried to
                                                use the block jacobi as
                                                the preconditioner with
                                                the same grid and same
                                                decomposition. The
                                                linear solver runs
                                                extremely slow but there
                                                is no memory error.<br>
                                                <br>
                                                How can I diagnose what
                                                exactly cause the error?<br>
                                                Thank you so much.<br>
                                                <br>
                                                Frank<br>
<petsc_options.txt><br>
                                              </blockquote>
                                            </blockquote>
<ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt><br>
                                          </blockquote>
                                        </blockquote>
                                        <br>
                                      </blockquote>
                                    </blockquote>
                                    <br>
                                  </div>
                                </div>
                              </div>
                            </blockquote>
                          </div>
                          <br>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
          <br>
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>