<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <p>Hi, <br>
    </p>
    <p>On 10/04/2016 11:24 AM, Matthew Knepley wrote:<br>
    </p>
    <blockquote
cite="mid:CAMYG4Gn6A6dZn1vtJZTMog+fN5PBTUZK3XoBwCC0SWfaUbpQXg@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">On Tue, Oct 4, 2016 at 1:13 PM, frank
            <span dir="ltr"><<a moz-do-not-send="true"
                href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px
              0.8ex;border-left:1px solid
              rgb(204,204,204);padding-left:1ex">
              <div bgcolor="#FFFFFF">
                <p>Hi,</p>
                This question is follow-up of the thread "Question about
                memory usage in Multigrid preconditioner".<br>
                I used to have the "Out of Memory(OOM)" problem when
                using the CG+Telescope MG solver with 32768 cores.
                Adding the "-matrap 0; -matptap_scalable" option did
                solve that problem. <br>
                <br>
                Then I test the scalability by solving a 3d poisson eqn
                for 1 step. I used one sub-communicator in all the
                tests. The difference between the petsc options in those
                tests are: 1 the pc_telescope_reduction_factor; 2 the
                number of multigrid levels in the up/down solver. The
                function "ksp_solve" is timed. It is kind of slow and
                doesn't scale at all. <br>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>1) The number of levels cannot be different in the
              up/down smoothers. Why are you using a / ?</div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
     I didn't mean the "up/down smoothers". I mean the "-pc_mg_levels"
    and "-mg_coarse_telescope_pc_mg_levels".
    <blockquote
cite="mid:CAMYG4Gn6A6dZn1vtJZTMog+fN5PBTUZK3XoBwCC0SWfaUbpQXg@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <div><br>
            </div>
            <div>2) We need to see what solver you actually constructed,
              so give us the output of -ksp_view</div>
            <div><br>
            </div>
            <div>3) For any performance questions, we need the output of
              -log_view</div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    I attached the log_view's ouput for all the eight runs. <br>
    The file is named by the cores# and the grid size. Ex,
    log_512_4096.txt is log_view from the case using 512^3 grid points
    and 4096 cores.<br>
    <br>
    I attach two ksp_view's output, just in case too many file become
    messy. The ksp_view for the other tests are quite similar. The only
    difference is the number of MG levels.<br>
    <blockquote
cite="mid:CAMYG4Gn6A6dZn1vtJZTMog+fN5PBTUZK3XoBwCC0SWfaUbpQXg@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <div><br>
            </div>
            <div>4) It looks like you are fixing the number of levels as
              you scale up. This makes the coarse problem much bigger,
              and is not a scalable way to proceed.</div>
            <div>    Have you looked at the ratio of coarse grid time to
              level time?</div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    How can I find the ratio?<br>
    <blockquote
cite="mid:CAMYG4Gn6A6dZn1vtJZTMog+fN5PBTUZK3XoBwCC0SWfaUbpQXg@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <div><br>
            </div>
            <div>5) Did you look at the options in this paper: <a
                moz-do-not-send="true"
                href="https://arxiv.org/abs/1604.07163">https://arxiv.org/abs/1604.07163</a></div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    I am going to look at it now <br>
    <br>
    Thank you.<br>
    Frank<br>
    <br>
    <blockquote
cite="mid:CAMYG4Gn6A6dZn1vtJZTMog+fN5PBTUZK3XoBwCC0SWfaUbpQXg@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <div><br>
            </div>
            <div>  Thanks,</div>
            <div><br>
            </div>
            <div>     Matt</div>
            <div> </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px
              0.8ex;border-left:1px solid
              rgb(204,204,204);padding-left:1ex">
              <div bgcolor="#FFFFFF"> Test1: 512^3 grid points<br>
                Core#        telescope_reduction_factor    <wbr>    MG
                levels# for up/down solver     Time for KSPSolve (s)<br>
                512             8                             <wbr>                   
                4 / 3                             <wbr>                
                6.2466<br>
                4096           64                            <wbr>                  
                5 / 3                             <wbr>                
                0.9361<br>
                32768         64                            <wbr>                  
                4 / 3                             <wbr>                
                4.8914<br>
                <br>
                Test2: 1024^3 grid points<br>
                Core#        telescope_reduction_factor    <wbr>    MG
                levels# for up/down solver     Time for KSPSolve (s)<br>
                4096           64                            <wbr>                  
                5 / 4                               <wbr>              
                3.4139<br>
                8192           128                           <wbr>                 
                5 / 4                             <wbr>                
                2.4196<br>
                16384         32                                        <wbr>      
                5 / 3                               <wbr>              
                5.4150<br>
                32768         64                            <wbr>                  
                5 / 3                             <wbr>                
                5.6067<br>
                65536         128                           <wbr>                 
                5 / 3                             <wbr>                
                6.5219<br>
                <br>
                I guess I didn't set the MG levels properly. What would
                be the efficient way to arrange the MG levels?<br>
                Also which preconditionr at the coarse mesh of the 2nd
                communicator should I use to improve the performance? <br>
                <br>
                I attached the test code and the petsc options file for
                the 1024^3 cube with 32768 cores. <br>
                <br>
                Thank you.<br>
                <br>
                Regards,<br>
                Frank<br>
                <br>
                <br>
                <br>
                <br>
                <br>
                <br>
                <div class="gmail-m_5791256141221066626moz-cite-prefix">On
                  09/15/2016 03:35 AM, Dave May wrote:<br>
                </div>
                <blockquote type="cite">
                  <div dir="ltr">
                    <div>
                      <div>
                        <div>
                          <div>
                            <div>HI all,<br>
                              <br>
                            </div>
                            <div>I the only unexpected memory usage I
                              can see is associated with the call to
                              MatPtAP().<br>
                            </div>
                            <div>Here is something you can try
                              immediately.<br>
                            </div>
                          </div>
                          Run your code with the additional options<br>
                            -matrap 0 -matptap_scalable<br>
                          <br>
                        </div>
                        <div>I didn't realize this before, but the
                          default behaviour of MatPtAP in parallel is
                          actually to to explicitly form the transpose
                          of P (e.g. assemble R = P^T) and then compute
                          R.A.P. <br>
                          You don't want to do this. The option -matrap
                          0 resolves this issue.<br>
                        </div>
                        <div><br>
                        </div>
                        <div>The implementation of P^T.A.P has two
                          variants. <br>
                          The scalable implementation (with respect to
                          memory usage) is selected via the second
                          option -matptap_scalable.</div>
                        <div><br>
                        </div>
                        <div>Try it out - I see a significant memory
                          reduction using these options for particular
                          mesh sizes / partitions.<br>
                        </div>
                        <div><br>
                        </div>
                        I've attached a cleaned up version of the code
                        you sent me.<br>
                      </div>
                      There were a number of memory leaks and other
                      issues.<br>
                    </div>
                    <div>The main points being<br>
                    </div>
                      * You should call DMDAVecGetArrayF90() before
                    VecAssembly{Begin,End}<br>
                      * You should call PetscFinalize(), otherwise the
                    option -log_summary (-log_view) will not display
                    anything once the program has completed.<br>
                    <div>
                      <div>
                        <div><br>
                          <br>
                        </div>
                        <div>Thanks,<br>
                        </div>
                        <div>  Dave<br>
                        </div>
                        <div>
                          <div>
                            <div><br>
                            </div>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                  <div class="gmail_extra"><br>
                    <div class="gmail_quote">On 15 September 2016 at
                      08:03, Hengjie Wang <span dir="ltr"><<a
                          moz-do-not-send="true"
                          href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span>
                      wrote:<br>
                      <blockquote class="gmail_quote" style="margin:0px
                        0px 0px 0.8ex;border-left:1px solid
                        rgb(204,204,204);padding-left:1ex">
                        <div bgcolor="#FFFFFF"> Hi Dave,<br>
                          <br>
                          Sorry, I should have put more comment to
                          explain the code.  <br>
                          The number of process in each dimension is the
                          same: Px = Py=Pz=P. So is the domain size.<br>
                          So if the you want to run the code for a 
                          512^3 grid points on 16^3 cores, you need to
                          set "-N 512 -P 16" in the command line.<br>
                          I add more comments and also fix an error in
                          the attached code. ( The error only effects
                          the accuracy of solution but not the memory
                          usage. ) <br>
                          <div><br>
                            Thank you.<span
                              class="gmail-m_5791256141221066626HOEnZb"><font
                                color="#888888"><br>
                                Frank</font></span>
                            <div>
                              <div class="gmail-m_5791256141221066626h5"><br>
                                <br>
                                On 9/14/2016 9:05 PM, Dave May wrote:<br>
                              </div>
                            </div>
                          </div>
                          <div>
                            <div class="gmail-m_5791256141221066626h5">
                              <blockquote type="cite"><br>
                                <br>
                                On Thursday, 15 September 2016, Dave May
                                <<a moz-do-not-send="true"
                                  href="mailto:dave.mayhem23@gmail.com"
                                  target="_blank">dave.mayhem23@gmail.com</a>>
                                wrote:<br>
                                <blockquote class="gmail_quote"
                                  style="margin:0px 0px 0px
                                  0.8ex;border-left:1px solid
                                  rgb(204,204,204);padding-left:1ex"><br>
                                  <br>
                                  On Thursday, 15 September 2016, frank
                                  <<a moz-do-not-send="true">hengjiew@uci.edu</a>>
                                  wrote:<br>
                                  <blockquote class="gmail_quote"
                                    style="margin:0px 0px 0px
                                    0.8ex;border-left:1px solid
                                    rgb(204,204,204);padding-left:1ex">
                                    <div bgcolor="#FFFFFF"> Hi, <br>
                                      <br>
                                      I write a simple code to
                                      re-produce the error. I hope this
                                      can help to diagnose the problem.<br>
                                      The code just solves a 3d poisson
                                      equation. </div>
                                  </blockquote>
                                  <div><br>
                                  </div>
                                  <div>Why is the stencil width a
                                    runtime parameter?? And why is the
                                    default value 2? For 7-pnt FD
                                    Laplace, you only need a stencil
                                    width of 1. </div>
                                  <div><br>
                                  </div>
                                  <div>Was this choice made to mimic
                                    something in the real application
                                    code?</div>
                                </blockquote>
                                <div><br>
                                </div>
                                Please ignore - I misunderstood your
                                usage of the param set by -P
                                <div>
                                  <div> </div>
                                  <blockquote class="gmail_quote"
                                    style="margin:0px 0px 0px
                                    0.8ex;border-left:1px solid
                                    rgb(204,204,204);padding-left:1ex">
                                    <div> </div>
                                    <blockquote class="gmail_quote"
                                      style="margin:0px 0px 0px
                                      0.8ex;border-left:1px solid
                                      rgb(204,204,204);padding-left:1ex">
                                      <div bgcolor="#FFFFFF"><br>
                                        I run the code on a 1024^3 mesh.
                                        The process partition is 32 * 32
                                        * 32. That's when I re-produce
                                        the OOM error. Each core has
                                        about 2G memory.<br>
                                        I also run the code on a 512^3
                                        mesh with 16 * 16 * 16
                                        processes. The ksp solver works
                                        fine. <br>
                                        I attached the code,
                                        ksp_view_pre's output and my
                                        petsc option file.<br>
                                        <br>
                                        Thank you.<br>
                                        Frank<br>
                                        <div><br>
                                          On 09/09/2016 06:38 PM,
                                          Hengjie Wang wrote:<br>
                                        </div>
                                        <blockquote type="cite">Hi
                                          Barry, 
                                          <div><br>
                                          </div>
                                          <div>I checked. On the
                                            supercomputer, I had the
                                            option "-ksp_view_pre" but
                                            it is not in file I sent
                                            you. I am sorry for the
                                            confusion.</div>
                                          <div><br>
                                          </div>
                                          <div>Regards,</div>
                                          <div>Frank<span></span><br>
                                            <br>
                                            On Friday, September 9,
                                            2016, Barry Smith <<a
                                              moz-do-not-send="true">bsmith@mcs.anl.gov</a>>
                                            wrote:<br>
                                            <blockquote
                                              class="gmail_quote"
                                              style="margin:0px 0px 0px
                                              0.8ex;border-left:1px
                                              solid
                                              rgb(204,204,204);padding-left:1ex"><br>
                                              > On Sep 9, 2016, at
                                              3:11 PM, frank <<a
                                                moz-do-not-send="true">hengjiew@uci.edu</a>>
                                              wrote:<br>
                                              ><br>
                                              > Hi Barry,<br>
                                              ><br>
                                              > I think the first KSP
                                              view output is from
                                              -ksp_view_pre. Before I
                                              submitted the test, I was
                                              not sure whether there
                                              would be OOM error or not.
                                              So I added both
                                              -ksp_view_pre and
                                              -ksp_view.<br>
                                              <br>
                                                But the options file you
                                              sent specifically does NOT
                                              list the -ksp_view_pre so
                                              how could it be from that?<br>
                                              <br>
                                                 Sorry to be pedantic
                                              but I've spent too much
                                              time in the past trying to
                                              debug from incorrect
                                              information and want to
                                              make sure that the
                                              information I have is
                                              correct before thinking.
                                              Please recheck exactly
                                              what happened. Rerun with
                                              the exact input file you
                                              emailed if that is needed.<br>
                                              <br>
                                                 Barry<br>
                                              <br>
                                              ><br>
                                              > Frank<br>
                                              ><br>
                                              ><br>
                                              > On 09/09/2016 12:38
                                              PM, Barry Smith wrote:<br>
                                              >>   Why does
                                              ksp_view2.txt have two KSP
                                              views in it while
                                              ksp_view1.txt has only one
                                              KSPView in it? Did you run
                                              two different solves in
                                              the 2 case but not the
                                              one?<br>
                                              >><br>
                                              >>   Barry<br>
                                              >><br>
                                              >><br>
                                              >><br>
                                              >>> On Sep 9,
                                              2016, at 10:56 AM, frank
                                              <<a
                                                moz-do-not-send="true">hengjiew@uci.edu</a>>
                                              wrote:<br>
                                              >>><br>
                                              >>> Hi,<br>
                                              >>><br>
                                              >>> I want to
                                              continue digging into the
                                              memory problem here.<br>
                                              >>> I did find a
                                              work around in the past,
                                              which is to use less cores
                                              per node so that each core
                                              has 8G memory. However
                                              this is deficient and
                                              expensive. I hope to
                                              locate the place that uses
                                              the most memory.<br>
                                              >>><br>
                                              >>> Here is a
                                              brief summary of the tests
                                              I did in past:<br>
                                              >>>> Test1: 
                                               Mesh 1536*128*384  | 
                                              Process Mesh 48*4*12<br>
                                              >>> Maximum (over
                                              computational time)
                                              process memory:         
                                               total 7.0727e+08<br>
                                              >>> Current
                                              process memory:           
                                                                       
                                                                 total
                                              7.0727e+08<br>
                                              >>> Maximum (over
                                              computational time) space
                                              PetscMalloc()ed:  total
                                              6.3908e+11<br>
                                              >>> Current space
                                              PetscMalloc()ed:         
                                                                       
                                                          total
                                              1.8275e+09<br>
                                              >>><br>
                                              >>>> Test2:   
                                              Mesh 1536*128*384  | 
                                              Process Mesh 96*8*24<br>
                                              >>> Maximum (over
                                              computational time)
                                              process memory:         
                                               total 5.9431e+09<br>
                                              >>> Current
                                              process memory:           
                                                                       
                                                                 total
                                              5.9431e+09<br>
                                              >>> Maximum (over
                                              computational time) space
                                              PetscMalloc()ed:  total
                                              5.3202e+12<br>
                                              >>> Current space
                                              PetscMalloc()ed:         
                                                                       
                                                           total
                                              5.4844e+09<br>
                                              >>><br>
                                              >>>> Test3:   
                                              Mesh 3072*256*768  | 
                                              Process Mesh 96*8*24<br>
                                              >>>     OOM( Out
                                              Of Memory ) killer of the
                                              supercomputer terminated
                                              the job during "KSPSolve".<br>
                                              >>><br>
                                              >>> I attached
                                              the output of ksp_view(
                                              the third test's output is
                                              from ksp_view_pre ),
                                              memory_view and also the
                                              petsc options.<br>
                                              >>><br>
                                              >>> In all the
                                              tests, each core can
                                              access about 2G memory. In
                                              test3, there are
                                              4223139840 non-zeros in
                                              the matrix. This will
                                              consume about 1.74M, using
                                              double precision.
                                              Considering some extra
                                              memory used to store
                                              integer index, 2G memory
                                              should still be way
                                              enough.<br>
                                              >>><br>
                                              >>> Is there a
                                              way to find out which part
                                              of KSPSolve uses the most
                                              memory?<br>
                                              >>> Thank you so
                                              much.<br>
                                              >>><br>
                                              >>> BTW, there
                                              are 4 options remains
                                              unused and I don't
                                              understand why they are
                                              omitted:<br>
                                              >>>
                                              -mg_coarse_telescope_mg_coarse<wbr>_ksp_type
                                              value: preonly<br>
                                              >>>
                                              -mg_coarse_telescope_mg_coarse<wbr>_pc_type
                                              value: bjacobi<br>
                                              >>>
                                              -mg_coarse_telescope_mg_levels<wbr>_ksp_max_it
                                              value: 1<br>
                                              >>>
                                              -mg_coarse_telescope_mg_levels<wbr>_ksp_type
                                              value: richardson<br>
                                              >>><br>
                                              >>><br>
                                              >>> Regards,<br>
                                              >>> Frank<br>
                                              >>><br>
                                              >>> On 07/13/2016
                                              05:47 PM, Dave May wrote:<br>
                                              >>>><br>
                                              >>>> On 14
                                              July 2016 at 01:07, frank
                                              <<a
                                                moz-do-not-send="true">hengjiew@uci.edu</a>>
                                              wrote:<br>
                                              >>>> Hi Dave,<br>
                                              >>>><br>
                                              >>>> Sorry for
                                              the late reply.<br>
                                              >>>> Thank you
                                              so much for your detailed
                                              reply.<br>
                                              >>>><br>
                                              >>>> I have a
                                              question about the
                                              estimation of the memory
                                              usage. There are
                                              4223139840 allocated
                                              non-zeros and 18432 MPI
                                              processes. Double
                                              precision is used. So the
                                              memory per process is:<br>
                                              >>>> 
                                               4223139840 * 8bytes /
                                              18432 / 1024 / 1024 =
                                              1.74M ?<br>
                                              >>>> Did I do
                                              sth wrong here? Because
                                              this seems too small.<br>
                                              >>>><br>
                                              >>>> No - I
                                              totally f***ed it up. You
                                              are correct. That'll teach
                                              me for fumbling around
                                              with my iphone calculator
                                              and not using my brain.
                                              (Note that to convert to
                                              MB just divide by 1e6, not
                                              1024^2 - although I
                                              apparently cannot convert
                                              between units
                                              correctly....)<br>
                                              >>>><br>
                                              >>>> From the
                                              PETSc objects associated
                                              with the solver, It looks
                                              like it _should_ run with
                                              2GB per MPI rank. Sorry
                                              for my mistake.
                                              Possibilities are:
                                              somewhere in your usage of
                                              PETSc you've introduced a
                                              memory leak; PETSc is
                                              doing a huge over
                                              allocation (e.g. as per
                                              our discussion of
                                              MatPtAP); or in your
                                              application code there are
                                              other objects you have
                                              forgotten to log the
                                              memory for.<br>
                                              >>>><br>
                                              >>>><br>
                                              >>>><br>
                                              >>>> I am
                                              running this job on
                                              Bluewater<br>
                                              >>>> I am
                                              using the 7 points FD
                                              stencil in 3D.<br>
                                              >>>><br>
                                              >>>> I thought
                                              so on both counts.<br>
                                              >>>><br>
                                              >>>> I
                                              apologize that I made a
                                              stupid mistake in
                                              computing the memory per
                                              core. My settings render
                                              each core can access only
                                              2G memory on average
                                              instead of 8G which I
                                              mentioned in previous
                                              email. I re-run the job
                                              with 8G memory per core on
                                              average and there is no
                                              "Out Of Memory" error. I
                                              would do more test to see
                                              if there is still some
                                              memory issue.<br>
                                              >>>><br>
                                              >>>> Ok. I'd
                                              still like to know where
                                              the memory was being used
                                              since my estimates were
                                              off.<br>
                                              >>>><br>
                                              >>>><br>
                                              >>>> Thanks,<br>
                                              >>>>   Dave<br>
                                              >>>><br>
                                              >>>> Regards,<br>
                                              >>>> Frank<br>
                                              >>>><br>
                                              >>>><br>
                                              >>>><br>
                                              >>>> On
                                              07/11/2016 01:18 PM, Dave
                                              May wrote:<br>
                                              >>>>> Hi
                                              Frank,<br>
                                              >>>>><br>
                                              >>>>><br>
                                              >>>>> On 11
                                              July 2016 at 19:14, frank
                                              <<a
                                                moz-do-not-send="true">hengjiew@uci.edu</a>>
                                              wrote:<br>
                                              >>>>> Hi
                                              Dave,<br>
                                              >>>>><br>
                                              >>>>> I
                                              re-run the test using
                                              bjacobi as the
                                              preconditioner on the
                                              coarse mesh of telescope.
                                              The Grid is 3072*256*768
                                              and process mesh is
                                              96*8*24. The petsc option
                                              file is attached.<br>
                                              >>>>> I
                                              still got the "Out Of
                                              Memory" error. The error
                                              occurred before the linear
                                              solver finished one step.
                                              So I don't have the full
                                              info from ksp_view. The
                                              info from ksp_view_pre is
                                              attached.<br>
                                              >>>>><br>
                                              >>>>> Okay
                                              - that is essentially
                                              useless (sorry)<br>
                                              >>>>><br>
                                              >>>>> It
                                              seems to me that the error
                                              occurred when the
                                              decomposition was going to
                                              be changed.<br>
                                              >>>>><br>
                                              >>>>> Based
                                              on what information?<br>
                                              >>>>>
                                              Running with -info would
                                              give us more clues, but
                                              will create a ton of
                                              output.<br>
                                              >>>>>
                                              Please try running the
                                              case which failed with
                                              -info<br>
                                              >>>>>  I
                                              had another test with a
                                              grid of 1536*128*384 and
                                              the same process mesh as
                                              above. There was no error.
                                              The ksp_view info is
                                              attached for comparison.<br>
                                              >>>>> Thank
                                              you.<br>
                                              >>>>><br>
                                              >>>>><br>
                                              >>>>> [3]
                                              Here is my crude estimate
                                              of your memory usage.<br>
                                              >>>>> I'll
                                              target the biggest memory
                                              hogs only to get an order
                                              of magnitude estimate<br>
                                              >>>>><br>
                                              >>>>> * The
                                              Fine grid operator
                                              contains 4223139840
                                              non-zeros --> 1.8 GB
                                              per MPI rank assuming
                                              double precision.<br>
                                              >>>>> The
                                              indices for the AIJ could
                                              amount to another 0.3 GB
                                              (assuming 32 bit integers)<br>
                                              >>>>><br>
                                              >>>>> * You
                                              use 5 levels of
                                              coarsening, so the other
                                              operators should represent
                                              (collectively)<br>
                                              >>>>> 2.1 /
                                              8 + 2.1/8^2 + 2.1/8^3 +
                                              2.1/8^4  ~ 300 MB per MPI
                                              rank on the communicator
                                              with 18432 ranks.<br>
                                              >>>>> The
                                              coarse grid should consume
                                              ~ 0.5 MB per MPI rank on
                                              the communicator with
                                              18432 ranks.<br>
                                              >>>>><br>
                                              >>>>> * You
                                              use a reduction factor of
                                              64, making the new
                                              communicator with 288 MPI
                                              ranks.<br>
                                              >>>>>
                                              PCTelescope will first
                                              gather a temporary matrix
                                              associated with your
                                              coarse level operator
                                              assuming a comm size of
                                              288 living on the comm
                                              with size 18432.<br>
                                              >>>>> This
                                              matrix will require
                                              approximately 0.5 * 64 =
                                              32 MB per core on the 288
                                              ranks.<br>
                                              >>>>> This
                                              matrix is then used to
                                              form a new MPIAIJ matrix
                                              on the subcomm, thus
                                              require another 32 MB per
                                              rank.<br>
                                              >>>>> The
                                              temporary matrix is now
                                              destroyed.<br>
                                              >>>>><br>
                                              >>>>> *
                                              Because a DMDA is
                                              detected, a permutation
                                              matrix is assembled.<br>
                                              >>>>> This
                                              requires 2 doubles per
                                              point in the DMDA.<br>
                                              >>>>> Your
                                              coarse DMDA contains 92 x
                                              16 x 48 points.<br>
                                              >>>>> Thus
                                              the permutation matrix
                                              will require < 1 MB per
                                              MPI rank on the sub-comm.<br>
                                              >>>>><br>
                                              >>>>> *
                                              Lastly, the matrix is
                                              permuted. This uses
                                              MatPtAP(), but the
                                              resulting operator will
                                              have the same memory
                                              footprint as the
                                              unpermuted matrix (32 MB).
                                              At any stage in
                                              PCTelescope, only 2
                                              operators of size 32 MB
                                              are held in memory when
                                              the DMDA is provided.<br>
                                              >>>>><br>
                                              >>>>> From
                                              my rough estimates, the
                                              worst case memory foot
                                              print for any given core,
                                              given your options is
                                              approximately<br>
                                              >>>>> 2100
                                              MB + 300 MB + 32 MB + 32
                                              MB + 1 MB  = 2465 MB<br>
                                              >>>>> This
                                              is way below 8 GB.<br>
                                              >>>>><br>
                                              >>>>> Note
                                              this estimate completely
                                              ignores:<br>
                                              >>>>> (1)
                                              the memory required for
                                              the restriction operator,<br>
                                              >>>>> (2)
                                              the potential growth in
                                              the number of non-zeros
                                              per row due to Galerkin
                                              coarsening (I wished
                                              -ksp_view_pre reported the
                                              output from MatView so we
                                              could see the number of
                                              non-zeros required by the
                                              coarse level operators)<br>
                                              >>>>> (3)
                                              all temporary vectors
                                              required by the CG solver,
                                              and those required by the
                                              smoothers.<br>
                                              >>>>> (4)
                                              internal memory allocated
                                              by MatPtAP<br>
                                              >>>>> (5)
                                              memory associated with
                                              IS's used within
                                              PCTelescope<br>
                                              >>>>><br>
                                              >>>>> So
                                              either I am completely off
                                              in my estimates, or you
                                              have not carefully
                                              estimated the memory usage
                                              of your application code.
                                              Hopefully others might
                                              examine/correct my rough
                                              estimates<br>
                                              >>>>><br>
                                              >>>>> Since
                                              I don't have your code I
                                              cannot access the latter.<br>
                                              >>>>> Since
                                              I don't have access to the
                                              same machine you are
                                              running on, I think we
                                              need to take a step back.<br>
                                              >>>>><br>
                                              >>>>> [1]
                                              What machine are you
                                              running on? Send me a URL
                                              if its available<br>
                                              >>>>><br>
                                              >>>>> [2]
                                              What discretization are
                                              you using? (I am guessing
                                              a scalar 7 point FD
                                              stencil)<br>
                                              >>>>> If
                                              it's a 7 point FD stencil,
                                              we should be able to
                                              examine the memory usage
                                              of your solver
                                              configuration using a
                                              standard, light weight
                                              existing PETSc example,
                                              run on your machine at the
                                              same scale.<br>
                                              >>>>> This
                                              would hopefully enable us
                                              to correctly evaluate the
                                              actual memory usage
                                              required by the solver
                                              configuration you are
                                              using.<br>
                                              >>>>><br>
                                              >>>>>
                                              Thanks,<br>
                                              >>>>> 
                                               Dave<br>
                                              >>>>><br>
                                              >>>>><br>
                                              >>>>> Frank<br>
                                              >>>>><br>
                                              >>>>><br>
                                              >>>>><br>
                                              >>>>><br>
                                              >>>>> On
                                              07/08/2016 10:38 PM, Dave
                                              May wrote:<br>
                                              >>>>>><br>
                                              >>>>>>
                                              On Saturday, 9 July 2016,
                                              frank <<a
                                                moz-do-not-send="true">hengjiew@uci.edu</a>>
                                              wrote:<br>
                                              >>>>>>
                                              Hi Barry and Dave,<br>
                                              >>>>>><br>
                                              >>>>>>
                                              Thank both of you for the
                                              advice.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              @Barry<br>
                                              >>>>>> I
                                              made a mistake in the file
                                              names in last email. I
                                              attached the correct files
                                              this time.<br>
                                              >>>>>>
                                              For all the three tests,
                                              'Telescope' is used as the
                                              coarse preconditioner.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              == Test1:   Grid:
                                              1536*128*384,   Process
                                              Mesh: 48*4*12<br>
                                              >>>>>>
                                              Part of the memory usage: 
                                              Vector   125           
                                              124 3971904     0.<br>
                                              >>>>>> 
                                                                       
                                                                Matrix 
                                               101 101      9462372   
                                               0<br>
                                              >>>>>><br>
                                              >>>>>>
                                              == Test2: Grid:
                                              1536*128*384,   Process
                                              Mesh: 96*8*24<br>
                                              >>>>>>
                                              Part of the memory usage: 
                                              Vector   125           
                                              124 681672     0.<br>
                                              >>>>>> 
                                                                       
                                                                Matrix 
                                               101 101      1462180   
                                               0.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              In theory, the memory
                                              usage in Test1 should be 8
                                              times of Test2. In my
                                              case, it is about 6 times.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              == Test3: Grid:
                                              3072*256*768,   Process
                                              Mesh: 96*8*24. Sub-domain
                                              per process: 32*32*32<br>
                                              >>>>>>
                                              Here I get the out of
                                              memory error.<br>
                                              >>>>>><br>
                                              >>>>>> I
                                              tried to use -mg_coarse
                                              jacobi. In this way, I
                                              don't need to set
                                              -mg_coarse_ksp_type and
                                              -mg_coarse_pc_type
                                              explicitly, right?<br>
                                              >>>>>>
                                              The linear solver didn't
                                              work in this case. Petsc
                                              output some errors.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              @Dave<br>
                                              >>>>>>
                                              In test3, I use only one
                                              instance of 'Telescope'.
                                              On the coarse mesh of
                                              'Telescope', I used LU as
                                              the preconditioner instead
                                              of SVD.<br>
                                              >>>>>>
                                              If my set the levels
                                              correctly, then on the
                                              last coarse mesh of MG
                                              where it calls
                                              'Telescope', the
                                              sub-domain per process is
                                              2*2*2.<br>
                                              >>>>>>
                                              On the last coarse mesh of
                                              'Telescope', there is only
                                              one grid point per
                                              process.<br>
                                              >>>>>> I
                                              still got the OOM error.
                                              The detailed petsc option
                                              file is attached.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              Do you understand the
                                              expected memory usage for
                                              the particular parallel LU
                                              implementation you are
                                              using? I don't
                                              (seriously). Replace LU
                                              with bjacobi and re-run
                                              this test. My point about
                                              solver debugging is still
                                              valid.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              And please send the result
                                              of KSPView so we can see
                                              what is actually used in
                                              the computations<br>
                                              >>>>>><br>
                                              >>>>>>
                                              Thanks<br>
                                              >>>>>> 
                                               Dave<br>
                                              >>>>>><br>
                                              >>>>>><br>
                                              >>>>>>
                                              Thank you so much.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              Frank<br>
                                              >>>>>><br>
                                              >>>>>><br>
                                              >>>>>><br>
                                              >>>>>>
                                              On 07/06/2016 02:51 PM,
                                              Barry Smith wrote:<br>
                                              >>>>>>
                                              On Jul 6, 2016, at 4:19
                                              PM, frank <<a
                                                moz-do-not-send="true">hengjiew@uci.edu</a>>
                                              wrote:<br>
                                              >>>>>><br>
                                              >>>>>>
                                              Hi Barry,<br>
                                              >>>>>><br>
                                              >>>>>>
                                              Thank you for you advice.<br>
                                              >>>>>> I
                                              tried three test. In the
                                              1st test, the grid is
                                              3072*256*768 and the
                                              process mesh is 96*8*24.<br>
                                              >>>>>>
                                              The linear solver is 'cg'
                                              the preconditioner is 'mg'
                                              and 'telescope' is used as
                                              the preconditioner at the
                                              coarse mesh.<br>
                                              >>>>>>
                                              The system gives me the
                                              "Out of Memory" error
                                              before the linear system
                                              is completely solved.<br>
                                              >>>>>>
                                              The info from
                                              '-ksp_view_pre' is
                                              attached. I seems to me
                                              that the error occurs when
                                              it reaches the coarse
                                              mesh.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              The 2nd test uses a grid
                                              of 1536*128*384 and
                                              process mesh is 96*8*24.
                                              The 3rd                   
                                                                     
                                               test uses the same grid
                                              but a different process
                                              mesh 48*4*12.<br>
                                              >>>>>> 
                                                 Are you sure this is
                                              right? The total matrix
                                              and vector memory usage
                                              goes from 2nd test<br>
                                              >>>>>> 
                                                            Vector 
                                               384            383     
                                              8,193,712     0.<br>
                                              >>>>>> 
                                                            Matrix 
                                               103            103   
                                               11,508,688     0.<br>
                                              >>>>>>
                                              to 3rd test<br>
                                              >>>>>> 
                                                           Vector   384 
                                                        383     
                                              1,590,520     0.<br>
                                              >>>>>> 
                                                            Matrix 
                                               103            103     
                                              3,508,664     0.<br>
                                              >>>>>>
                                              that is the memory usage
                                              got smaller but if you
                                              have only 1/8th the
                                              processes and the same
                                              grid it should have gotten
                                              about 8 times bigger. Did
                                              you maybe cut the grid by
                                              a factor of 8 also? If so
                                              that still doesn't explain
                                              it because the memory
                                              usage changed by a factor
                                              of 5 something for the
                                              vectors and 3 something
                                              for the matrices.<br>
                                              >>>>>><br>
                                              >>>>>><br>
                                              >>>>>>
                                              The linear solver and
                                              petsc options in 2nd and
                                              3rd tests are the same in
                                              1st test. The linear
                                              solver works fine in both
                                              test.<br>
                                              >>>>>> I
                                              attached the memory usage
                                              of the 2nd and 3rd tests.
                                              The memory info is from
                                              the option '-log_summary'.
                                              I tried to use
                                              '-momery_info' as you
                                              suggested, but in my case
                                              petsc treated it as an
                                              unused option. It output
                                              nothing about the memory.
                                              Do I need to add sth to my
                                              code so I can use
                                              '-memory_info'?<br>
                                              >>>>>> 
                                                 Sorry, my mistake the
                                              option is -memory_view<br>
                                              >>>>>><br>
                                              >>>>>> 
                                                Can you run the one case
                                              with -memory_view and
                                              -mg_coarse jacobi
                                              -ksp_max_it 1 (just so it
                                              doesn't iterate forever)
                                              to see how much memory is
                                              used without the
                                              telescope? Also run case 2
                                              the same way.<br>
                                              >>>>>><br>
                                              >>>>>> 
                                                Barry<br>
                                              >>>>>><br>
                                              >>>>>><br>
                                              >>>>>><br>
                                              >>>>>>
                                              In both tests the memory
                                              usage is not large.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              It seems to me that it
                                              might be the 'telescope' 
                                              preconditioner that
                                              allocated a lot of memory
                                              and caused the error in
                                              the 1st test.<br>
                                              >>>>>>
                                              Is there is a way to show
                                              how much memory it
                                              allocated?<br>
                                              >>>>>><br>
                                              >>>>>>
                                              Frank<br>
                                              >>>>>><br>
                                              >>>>>>
                                              On 07/05/2016 03:37 PM,
                                              Barry Smith wrote:<br>
                                              >>>>>> 
                                                Frank,<br>
                                              >>>>>><br>
                                              >>>>>> 
                                                  You can run with
                                              -ksp_view_pre to have it
                                              "view" the KSP before the
                                              solve so hopefully it gets
                                              that far.<br>
                                              >>>>>><br>
                                              >>>>>> 
                                                   Please run the
                                              problem that does fit with
                                              -memory_info when the
                                              problem completes it will
                                              show the "high water mark"
                                              for PETSc allocated memory
                                              and total memory used. We
                                              first want to look at
                                              these numbers to see if it
                                              is using more memory than
                                              you expect. You could also
                                              run with say half the grid
                                              spacing to see how the
                                              memory usage scaled with
                                              the increase in grid
                                              points. Make the runs also
                                              with -log_view and send
                                              all the output from these
                                              options.<br>
                                              >>>>>><br>
                                              >>>>>> 
                                                 Barry<br>
                                              >>>>>><br>
                                              >>>>>>
                                              On Jul 5, 2016, at 5:23
                                              PM, frank <<a
                                                moz-do-not-send="true">hengjiew@uci.edu</a>>
                                              wrote:<br>
                                              >>>>>><br>
                                              >>>>>>
                                              Hi,<br>
                                              >>>>>><br>
                                              >>>>>> I
                                              am using the CG ksp solver
                                              and Multigrid
                                              preconditioner  to solve a
                                              linear system in parallel.<br>
                                              >>>>>> I
                                              chose to use the
                                              'Telescope' as the
                                              preconditioner on the
                                              coarse mesh for its good
                                              performance.<br>
                                              >>>>>>
                                              The petsc options file is
                                              attached.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              The domain is a 3d box.<br>
                                              >>>>>>
                                              It works well when the
                                              grid is  1536*128*384 and
                                              the process mesh is
                                              96*8*24. When I double the
                                              size of grid and         
                                                                       
                                                           keep the same
                                              process mesh and petsc
                                              options, I get an "out of
                                              memory" error from the
                                              super-cluster I am using.<br>
                                              >>>>>>
                                              Each process has access to
                                              at least 8G memory, which
                                              should be more than enough
                                              for my application. I am
                                              sure that all the other
                                              parts of my code( except
                                              the linear solver ) do not
                                              use much memory. So I
                                              doubt if there is
                                              something wrong with the
                                              linear solver.<br>
                                              >>>>>>
                                              The error occurs before
                                              the linear system is
                                              completely solved so I
                                              don't have the info from
                                              ksp view. I am not able to
                                              re-produce the error with
                                              a smaller problem either.<br>
                                              >>>>>>
                                              In addition,  I tried to
                                              use the block jacobi as
                                              the preconditioner with
                                              the same grid and same
                                              decomposition. The linear
                                              solver runs extremely slow
                                              but there is no memory
                                              error.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              How can I diagnose what
                                              exactly cause the error?<br>
                                              >>>>>>
                                              Thank you so much.<br>
                                              >>>>>><br>
                                              >>>>>>
                                              Frank<br>
                                              >>>>>>
                                              <petsc_options.txt><br>
                                              >>>>>>
                                              <ksp_view_pre.txt><memory_test<wbr>2.txt><memory_test3.txt><petsc<wbr>_options.txt><br>
                                              >>>>>><br>
                                              >>>>><br>
                                              >>>><br>
                                              >>>
                                              <ksp_view1.txt><ksp_view2.txt><wbr><ksp_view3.txt><memory1.txt><m<wbr>emory2.txt><petsc_options1.txt<wbr>><petsc_options2.txt><petsc_op<wbr>tions3.txt><br>
                                              ><br>
                                              <br>
                                            </blockquote>
                                          </div>
                                        </blockquote>
                                        <br>
                                      </div>
                                    </blockquote>
                                    <div> </div>
                                  </blockquote>
                                </div>
                              </blockquote>
                              <br>
                            </div>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                    <br>
                  </div>
                </blockquote>
                <br>
              </div>
            </blockquote>
          </div>
          <br>
          <br clear="all">
          <div><br>
          </div>
          -- <br>
          <div class="gmail_signature">What most experimenters take for
            granted before they begin their experiments is infinitely
            more interesting than any results to which their experiments
            lead.<br>
            -- Norbert Wiener</div>
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>