<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hi,<br>

    <br>

    I want to continue digging into the memory problem here.  <br>

    I did find a work around in the past, which is to use less cores per

    node so that each core has 8G memory. However this is deficient and

    expensive. I hope to locate the place that uses the most memory.<br>

    <br>

    Here is a brief summary of the tests I did in

    past:                   <br>

    > Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12 <br>

    Maximum (over computational time) process memory:           total

    7.0727e+08 <br>

    Current process

    memory:                                                        

    total 7.0727e+08 <br>

    Maximum (over computational time) space PetscMalloc()ed:  total

    6.3908e+11<br>

    Current space PetscMalloc()ed:                                     

              total 1.8275e+09 <br>

    <br>

    > Test2:    Mesh 1536*128*384  |  Process Mesh 96*8*24 <br>

    Maximum (over computational time) process memory:           total

    5.9431e+09 <br>

    Current process memory:                                            

                total 5.9431e+09<br>

    Maximum (over computational time) space PetscMalloc()ed:  total

    5.3202e+12<br>

    Current space

    PetscMalloc()ed:                                                

    total 5.4844e+09<br>

    <br>

    > Test3:    Mesh 3072*256*768  |  Process Mesh 96*8*24<br>

        OOM( Out Of Memory ) killer of the supercomputer terminated the

    job during "KSPSolve". <br>

    <br>

    I attached the output of ksp_view( the third test's output is from

    ksp_view_pre ), memory_view and also the petsc options.<br>

    <br>

    In all the tests, each core can access about 2G memory. In test3,

    there are 4223139840 non-zeros in the matrix. This will consume

    about 1.74M, using double precision. Considering some extra memory

    used to store integer index, 2G memory should still be way enough.<br>

    <br>

    Is there a way to find out which part of KSPSolve uses the most

    memory? <br>

    Thank you so much.<br>

    <br>

    BTW, there are 4 options remains unused and I don't understand why

    they are omitted:<br>

    -mg_coarse_telescope_mg_coarse_ksp_type value: preonly<br>

    -mg_coarse_telescope_mg_coarse_pc_type value: bjacobi<br>

    -mg_coarse_telescope_mg_levels_ksp_max_it value: 1<br>

    -mg_coarse_telescope_mg_levels_ksp_type value: richardson<br>

    <br>

    <br>

    Regards,<br>

    Frank<br>

    <br>

    <div class="moz-cite-prefix">On 07/13/2016 05:47 PM, Dave May wrote:<br>

    </div>

    <blockquote

cite="mid:CAJ98EDrRQfspLSv8kOuzVsXzH5bL2dfzdwu0VnhOJM2VbaxkWA@mail.gmail.com"

      type="cite">

      <div dir="ltr"><br>

        <div class="gmail_extra"><br>

          <div class="gmail_quote">On 14 July 2016 at 01:07, frank <span

              dir="ltr"><<a moz-do-not-send="true"

                href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000"> Hi Dave,<br>

                <br>

                Sorry for the late reply.<br>

                Thank you so much for your detailed reply.<br>

                <br>

                I have a question about the estimation of the memory

                usage. There are 4223139840 allocated non-zeros and

                18432 MPI processes. Double precision is used. So the

                memory per process is:<br>

                  4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ? <br>

                Did I do sth wrong here? Because this seems too small.<br>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>No - I totally f***ed it up. You are correct. That'll

              teach me for fumbling around with my iphone calculator and

              not using my brain. (Note that to convert to MB just

              divide by 1e6, not 1024^2 - although I apparently cannot

              convert between units correctly....)</div>

            <div><br>

            </div>

            <div>From the PETSc objects associated with the solver, It

              looks like it _should_ run with 2GB per MPI rank. Sorry

              for my mistake. Possibilities are: somewhere in your usage

              of PETSc you've introduced a memory leak; PETSc is doing a

              huge over allocation (e.g. as per our discussion of

              MatPtAP); or in your application code there are other

              objects you have forgotten to log the memory for.</div>

            <div><br>

            </div>

            <div><br>

            </div>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000"> <br>

                I am running this job on <a moz-do-not-send="true"

                  href="https://bluewaters.ncsa.illinois.edu/user-guide"

                  target="_blank">Bluewater</a> </div>

            </blockquote>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000"> I am using the 7

                points FD stencil in 3D. <br>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>I thought so on both counts.</div>

            <div> </div>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000"> <br>

                I apologize that I made a stupid mistake in computing

                the memory per core. My settings render each core can

                access only 2G memory on average instead of 8G which I

                mentioned in previous email. I re-run the job with 8G

                memory per core on average and there is no "Out Of

                Memory" error. I would do more test to see if there is

                still some memory issue.<br>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>Ok. I'd still like to know where the memory was being

              used since my estimates were off.</div>

            <div><br>

            </div>

            <div><br>

            </div>

            <div>Thanks,</div>

            <div>  Dave</div>

            <div> </div>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000"> <br>

                Regards,<br>

                Frank

                <div>

                  <div class="h5"><br>

                    <br>

                    <br>

                    <div>On 07/11/2016 01:18 PM, Dave May wrote:<br>

                    </div>

                    <blockquote type="cite">

                      <div dir="ltr">Hi Frank,<br>

                        <br>

                        <div class="gmail_extra"><br>

                          <div class="gmail_quote">On 11 July 2016 at

                            19:14, frank <span dir="ltr"><<a

                                moz-do-not-send="true"

                                href="mailto:hengjiew@uci.edu"

                                target="_blank">hengjiew@uci.edu</a>></span>

                            wrote:<br>

                            <blockquote class="gmail_quote">

                              <div> Hi Dave,<br>

                                <br>

                                I re-run the test using bjacobi as the

                                preconditioner on the coarse mesh of

                                telescope. The Grid is 3072*256*768 and

                                process mesh is 96*8*24. The petsc

                                option file is attached.<br>

                                I still got the "Out Of Memory" error.

                                The error occurred before the linear

                                solver finished one step. So I don't

                                have the full info from ksp_view. The

                                info from ksp_view_pre is attached.</div>

                            </blockquote>

                            <div><br>

                            </div>

                            <div>Okay - that is essentially useless

                              (sorry)<br>

                            </div>

                            <div> </div>

                            <blockquote class="gmail_quote">

                              <div> <br>

                                It seems to me that the error occurred

                                when the decomposition was going to be

                                changed.<br>

                              </div>

                            </blockquote>

                            <div><br>

                            </div>

                            <div>Based on what information?<br>

                            </div>

                            <div>Running with -info would give us more

                              clues, but will create a ton of output.<br>

                            </div>

                            <div>Please try running the case which

                              failed with -info<br>

                            </div>

                            <div> </div>

                            <blockquote class="gmail_quote">

                              <div> I had another test with a grid of

                                1536*128*384 and the same process mesh

                                as above. There was no error. The

                                ksp_view info is attached for

                                comparison.<br>

                                Thank you.</div>

                            </blockquote>

                            <div><br>

                            </div>

                            <div><br>

                              [3] Here is my crude estimate of your

                              memory usage. <br>

                              I'll target the biggest memory hogs only

                              to get an order of magnitude estimate<br>

                              <br>

                              <div>* The Fine grid operator contains

                                4223139840 non-zeros --> 1.8 GB per

                                MPI rank assuming double precision.<br>

                              </div>

                              <div>The indices for the AIJ could amount

                                to another 0.3 GB (assuming 32 bit

                                integers)<br>

                              </div>

                              <div><br>

                                * You use 5 levels of coarsening, so the

                                other operators should represent

                                (collectively)  <br>

                                2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~

                                300 MB per MPI rank on the communicator

                                with 18432 ranks.<br>

                              </div>

                              <div>The coarse grid should consume ~ 0.5

                                MB per MPI rank on the communicator with

                                18432 ranks.</div>

                              <div><br>

                                * You use a reduction factor of 64,

                                making the new communicator with 288 MPI

                                ranks. <br>

                                PCTelescope will first gather a

                                temporary matrix associated with your

                                coarse level operator assuming a comm

                                size of 288 living on the comm with size

                                18432. <br>

                                This matrix will require approximately

                                0.5 * 64 = 32 MB per core on the 288

                                ranks. <br>

                                This matrix is then used to form a new

                                MPIAIJ matrix on the subcomm, thus

                                require another 32 MB per rank. <br>

                                The temporary matrix is now destroyed.<br>

                              </div>

                              <div><br>

                                * Because a DMDA is detected, a

                                permutation matrix is assembled. <br>

                                This requires 2 doubles per point in the

                                DMDA. <br>

                                Your coarse DMDA contains 92 x 16 x 48

                                points. <br>

                                Thus the permutation matrix will require

                                < 1 MB per MPI rank on the sub-comm.<br>

                                <br>

                              </div>

                              <div>* Lastly, the matrix is permuted.

                                This uses MatPtAP(), but the resulting

                                operator will have the same memory

                                footprint as the unpermuted matrix (32

                                MB). At any stage in PCTelescope, only 2

                                operators of size 32 MB are held in

                                memory when the DMDA is provided.<br>

                              </div>

                              <div><br>

                              </div>

                              <div>From my rough estimates, the worst

                                case memory foot print for any given

                                core, given your options is

                                approximately <br>

                              </div>

                              <div>2100 MB + 300 MB + 32 MB + 32 MB + 1

                                MB  = 2465 MB<br>

                              </div>

                              <div>This is way below 8 GB.<br>

                                <br>

                                Note this estimate completely ignores:<br>

                                (1) the memory required for the

                                restriction operator, <br>

                                (2) the potential growth in the number

                                of non-zeros per row due to Galerkin

                                coarsening (I wished -ksp_view_pre

                                reported the output from MatView so we

                                could see the number of non-zeros

                                required by the coarse level operators)<br>

                              </div>

                              <div>(3) all temporary vectors required by

                                the CG solver, and those required by the

                                smoothers.<br>

                              </div>

                              <div>(4) internal memory allocated by

                                MatPtAP<br>

                              </div>

                              <div>(5) memory associated with IS's used

                                within PCTelescope<br>

                              </div>

                              <div><br>

                              </div>

                              So either I am completely off in my

                              estimates, or you have not carefully

                              estimated the memory usage of your

                              application code. Hopefully others might

                              examine/correct my rough estimates<br>

                            </div>

                            <div>

                              <div><br>

                                Since I don't have your code I cannot

                                access the latter.<br>

                                Since I don't have access to the same

                                machine you are running on, I think we

                                need to take a step back.<br>

                              </div>

                              <br>

                              [1] What machine are you running on? Send

                              me a URL if its available<br>

                            </div>

                            <div><br>

                              [2] What discretization are you using? (I

                              am guessing a scalar 7 point FD stencil)<br>

                            </div>

                            <div>If it's a 7 point FD stencil, we should

                              be able to examine the memory usage of

                              your solver configuration using a

                              standard, light weight existing PETSc

                              example, run on your machine at the same

                              scale. <br>

                            </div>

                            <div>This would hopefully enable us to

                              correctly evaluate the actual memory usage

                              required by the solver configuration you

                              are using.<br>

                            </div>

                            <div><br>

                            </div>

                            <div>Thanks,<br>

                            </div>

                            <div>  Dave<br>

                            </div>

                            <div> </div>

                            <blockquote class="gmail_quote">

                              <div><span><br>

                                  <br>

                                  Frank</span>

                                <div>

                                  <div><br>

                                    <br>

                                    <br>

                                    <br>

                                    <div>On 07/08/2016 10:38 PM, Dave

                                      May wrote:<br>

                                    </div>

                                    <blockquote type="cite"><br>

                                      <br>

                                      On Saturday, 9 July 2016, frank

                                      <<a moz-do-not-send="true"

                                        href="mailto:hengjiew@uci.edu"

                                        target="_blank">hengjiew@uci.edu</a>>

                                      wrote:<br>

                                      <blockquote class="gmail_quote">Hi

                                        Barry and Dave,<br>

                                        <br>

                                        Thank both of you for the

                                        advice.<br>

                                        <br>

                                        @Barry<br>

                                        I made a mistake in the file

                                        names in last email. I attached

                                        the correct files this time.<br>

                                        For all the three tests,

                                        'Telescope' is used as the

                                        coarse preconditioner.<br>

                                        <br>

                                        == Test1:   Grid: 1536*128*384, 

                                         Process Mesh: 48*4*12<br>

                                        Part of the memory usage: 

                                        Vector   125            124

                                        3971904     0.<br>

                                                     Matrix   101 101   

                                          9462372     0<br>

                                        <br>

                                        == Test2: Grid: 1536*128*384, 

                                         Process Mesh: 96*8*24<br>

                                        Part of the memory usage: 

                                        Vector   125            124

                                        681672     0.<br>

                                                     Matrix   101 101   

                                          1462180     0.<br>

                                        <br>

                                        In theory, the memory usage in

                                        Test1 should be 8 times of

                                        Test2. In my case, it is about 6

                                        times.<br>

                                        <br>

                                        == Test3: Grid: 3072*256*768, 

                                         Process Mesh: 96*8*24.

                                        Sub-domain per process: 32*32*32<br>

                                        Here I get the out of memory

                                        error.<br>

                                        <br>

                                        I tried to use -mg_coarse

                                        jacobi. In this way, I don't

                                        need to set -mg_coarse_ksp_type

                                        and -mg_coarse_pc_type

                                        explicitly, right?<br>

                                        The linear solver didn't work in

                                        this case. Petsc output some

                                        errors.<br>

                                        <br>

                                        @Dave<br>

                                        In test3, I use only one

                                        instance of 'Telescope'. On the

                                        coarse mesh of 'Telescope', I

                                        used LU as the preconditioner

                                        instead of SVD.<br>

                                        If my set the levels correctly,

                                        then on the last coarse mesh of

                                        MG where it calls 'Telescope',

                                        the sub-domain per process is

                                        2*2*2.<br>

                                        On the last coarse mesh of

                                        'Telescope', there is only one

                                        grid point per process.<br>

                                        I still got the OOM error. The

                                        detailed petsc option file is

                                        attached.</blockquote>

                                      <div><br>

                                      </div>

                                      <div>Do you understand the

                                        expected memory usage for the

                                        particular parallel

                                        LU implementation you are using?

                                        I don't (seriously). Replace LU

                                        with bjacobi and re-run this

                                        test. My point about solver

                                        debugging is still valid. </div>

                                      <div><br>

                                      </div>

                                      <div>And please send the result of

                                        KSPView so we can see what is

                                        actually used in the

                                        computations</div>

                                      <div><br>

                                      </div>

                                      <div>Thanks</div>

                                      <div>  Dave</div>

                                      <div> </div>

                                      <blockquote class="gmail_quote"> <br>

                                        <br>

                                        Thank you so much.<br>

                                        <br>

                                        Frank<br>

                                        <br>

                                        <br>

                                        <br>

                                        On 07/06/2016 02:51 PM, Barry

                                        Smith wrote:<br>

                                        <blockquote class="gmail_quote">

                                          <blockquote

                                            class="gmail_quote"> On Jul

                                            6, 2016, at 4:19 PM, frank

                                            <<a

                                              moz-do-not-send="true"

                                              href="mailto:hengjiew@uci.edu"

                                              target="_blank">hengjiew@uci.edu</a>>

                                            wrote:<br>

                                            <br>

                                            Hi Barry,<br>

                                            <br>

                                            Thank you for you advice.<br>

                                            I tried three test. In the

                                            1st test, the grid is

                                            3072*256*768 and the process

                                            mesh is 96*8*24.<br>

                                            The linear solver is 'cg'

                                            the preconditioner is 'mg'

                                            and 'telescope' is used as

                                            the preconditioner at the

                                            coarse mesh.<br>

                                            The system gives me the "Out

                                            of Memory" error before the

                                            linear system is completely

                                            solved.<br>

                                            The info from

                                            '-ksp_view_pre' is attached.

                                            I seems to me that the error

                                            occurs when it reaches the

                                            coarse mesh.<br>

                                            <br>

                                            The 2nd test uses a grid of

                                            1536*128*384 and process

                                            mesh is 96*8*24. The 3rd

                                            test uses the same grid but

                                            a different process mesh

                                            48*4*12.<br>

                                          </blockquote>

                                              Are you sure this is

                                          right? The total matrix and

                                          vector memory usage goes from

                                          2nd test<br>

                                                         Vector   384   

                                                  383      8,193,712   

                                           0.<br>

                                                         Matrix   103   

                                                  103     11,508,688   

                                           0.<br>

                                          to 3rd test<br>

                                                        Vector   384   

                                                  383      1,590,520   

                                           0.<br>

                                                         Matrix   103   

                                                  103      3,508,664   

                                           0.<br>

                                          that is the memory usage got

                                          smaller but if you have only

                                          1/8th the processes and the

                                          same grid it should have

                                          gotten about 8 times bigger.

                                          Did you maybe cut the grid by

                                          a factor of 8 also? If so that

                                          still doesn't explain it

                                          because the memory usage

                                          changed by a factor of 5

                                          something for the vectors and

                                          3 something for the matrices.<br>

                                          <br>

                                          <br>

                                          <blockquote

                                            class="gmail_quote"> The

                                            linear solver and petsc

                                            options in 2nd and 3rd tests

                                            are the same in 1st test.

                                            The linear solver works fine

                                            in both test.<br>

                                            I attached the memory usage

                                            of the 2nd and 3rd tests.

                                            The memory info is from the

                                            option '-log_summary'. I

                                            tried to use '-momery_info'

                                            as you suggested, but in my

                                            case petsc treated it as an

                                            unused option. It output

                                            nothing about the memory. Do

                                            I need to add sth to my code

                                            so I can use '-memory_info'?<br>

                                          </blockquote>

                                              Sorry, my mistake the

                                          option is -memory_view<br>

                                          <br>

                                             Can you run the one case

                                          with -memory_view and

                                          -mg_coarse jacobi -ksp_max_it

                                          1 (just so it doesn't iterate

                                          forever) to see how much

                                          memory is used without the

                                          telescope? Also run case 2 the

                                          same way.<br>

                                          <br>

                                             Barry<br>

                                          <br>

                                          <br>

                                          <br>

                                          <blockquote

                                            class="gmail_quote"> In both

                                            tests the memory usage is

                                            not large.<br>

                                            <br>

                                            It seems to me that it might

                                            be the 'telescope' 

                                            preconditioner that

                                            allocated a lot of memory

                                            and caused the error in the

                                            1st test.<br>

                                            Is there is a way to show

                                            how much memory it

                                            allocated?<br>

                                            <br>

                                            Frank<br>

                                            <br>

                                            On 07/05/2016 03:37 PM,

                                            Barry Smith wrote:<br>

                                            <blockquote

                                              class="gmail_quote">  

                                               Frank,<br>

                                              <br>

                                                   You can run with

                                              -ksp_view_pre to have it

                                              "view" the KSP before the

                                              solve so hopefully it gets

                                              that far.<br>

                                              <br>

                                                    Please run the

                                              problem that does fit with

                                              -memory_info when the

                                              problem completes it will

                                              show the "high water mark"

                                              for PETSc allocated memory

                                              and total memory used. We

                                              first want to look at

                                              these numbers to see if it

                                              is using more memory than

                                              you expect. You could also

                                              run with say half the grid

                                              spacing to see how the

                                              memory usage scaled with

                                              the increase in grid

                                              points. Make the runs also

                                              with -log_view and send

                                              all the output from these

                                              options.<br>

                                              <br>

                                                  Barry<br>

                                              <br>

                                              <blockquote

                                                class="gmail_quote"> On

                                                Jul 5, 2016, at 5:23 PM,

                                                frank <<a

                                                  moz-do-not-send="true"

href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>>

                                                wrote:<br>

                                                <br>

                                                Hi,<br>

                                                <br>

                                                I am using the CG ksp

                                                solver and Multigrid

                                                preconditioner  to solve

                                                a linear system in

                                                parallel.<br>

                                                I chose to use the

                                                'Telescope' as the

                                                preconditioner on the

                                                coarse mesh for its good

                                                performance.<br>

                                                The petsc options file

                                                is attached.<br>

                                                <br>

                                                The domain is a 3d box.<br>

                                                It works well when the

                                                grid is  1536*128*384

                                                and the process mesh is

                                                96*8*24. When I double

                                                the size of grid and

                                                keep the same process

                                                mesh and petsc options,

                                                I get an "out of memory"

                                                error from the

                                                super-cluster I am

                                                using.<br>

                                                Each process has access

                                                to at least 8G memory,

                                                which should be more

                                                than enough for my

                                                application. I am sure

                                                that all the other parts

                                                of my code( except the

                                                linear solver ) do not

                                                use much memory. So I

                                                doubt if there is

                                                something wrong with the

                                                linear solver.<br>

                                                The error occurs before

                                                the linear system is

                                                completely solved so I

                                                don't have the info from

                                                ksp view. I am not able

                                                to re-produce the error

                                                with a smaller problem

                                                either.<br>

                                                In addition,  I tried to

                                                use the block jacobi as

                                                the preconditioner with

                                                the same grid and same

                                                decomposition. The

                                                linear solver runs

                                                extremely slow but there

                                                is no memory error.<br>

                                                <br>

                                                How can I diagnose what

                                                exactly cause the error?<br>

                                                Thank you so much.<br>

                                                <br>

                                                Frank<br>

<petsc_options.txt><br>

                                              </blockquote>

                                            </blockquote>

<ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt><br>

                                          </blockquote>

                                        </blockquote>

                                        <br>

                                      </blockquote>

                                    </blockquote>

                                    <br>

                                  </div>

                                </div>

                              </div>

                            </blockquote>

                          </div>

                          <br>

                        </div>

                      </div>

                    </blockquote>

                    <br>

                  </div>

                </div>

              </div>

            </blockquote>

          </div>

          <br>

        </div>

      </div>

    </blockquote>

    <br>

  </body>

</html>