<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <br>
    <div class="moz-cite-prefix">On 10/04/2016 01:20 PM, Matthew Knepley
      wrote:<br>
    </div>
    <blockquote
cite="mid:CAMYG4GkBQtSpUtgdGixwAD86JgZUY0ZCWy=uS6aD9i7Dnqc6Fg@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">On Tue, Oct 4, 2016 at 3:09 PM, frank
            <span dir="ltr"><<a moz-do-not-send="true"
                href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000">
                <div class="m_-8767381834923078048moz-cite-prefix">Hi
                  Dave,<br>
                  <br>
                  Thank you for the reply.<br>
                  What do you mean by the "nested calls to KSPSolve"?<br>
                </div>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>KSPSolve is called again after redistributing the
              computation.</div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    I am still confused. There is only one KSPSolve in my code. <br>
    Do you mean KSPSolve is called again in the sub-communicator? If
    that's the case, even if I put two identical KSPSolve in the code,
    the sub-communicator is still going to call KSPSolve, right?<br>
    <br>
    <blockquote
cite="mid:CAMYG4GkBQtSpUtgdGixwAD86JgZUY0ZCWy=uS6aD9i7Dnqc6Fg@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <div> </div>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000">
                <div class="m_-8767381834923078048moz-cite-prefix"> I
                  tried to call KSPSolve twice, but the the second solve
                  converged in 0 iteration. KSPSolve seems to remember
                  the solution. How can I force both solves start from
                  the same initial guess?<br>
                </div>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>Did you zero the solution vector between solves?
              VecSet(x, 0.0);</div>
            <div><br>
            </div>
            <div>  Matt</div>
            <div> </div>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000">
                <div class="m_-8767381834923078048moz-cite-prefix">
                  Thank you.<span class="HOEnZb"><font color="#888888"><br>
                      <br>
                      Frank</font></span>
                  <div>
                    <div class="h5"><br>
                      <br>
                      <br>
                      On 10/04/2016 12:56 PM, Dave May wrote:<br>
                    </div>
                  </div>
                </div>
                <div>
                  <div class="h5">
                    <blockquote type="cite"><br>
                      <br>
                      On Tuesday, 4 October 2016, frank <<a
                        moz-do-not-send="true"
                        href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>>
                      wrote:<br>
                      <blockquote class="gmail_quote" style="margin:0 0
                        0 .8ex;border-left:1px #ccc
                        solid;padding-left:1ex">
                        <div bgcolor="#FFFFFF" text="#000000">
                          <p>Hi,</p>
                          This question is follow-up of the thread
                          "Question about memory usage in Multigrid
                          preconditioner".<br>
                          I used to have the "Out of Memory(OOM)"
                          problem when using the CG+Telescope MG solver
                          with 32768 cores. Adding the "-matrap 0;
                          -matptap_scalable" option did solve that
                          problem. <br>
                          <br>
                          Then I test the scalability by solving a 3d
                          poisson eqn for 1 step. I used one
                          sub-communicator in all the tests. The
                          difference between the petsc options in those
                          tests are: 1 the
                          pc_telescope_reduction_factor; 2 the number of
                          multigrid levels in the up/down solver. The
                          function "ksp_solve" is timed. It is kind of
                          slow and doesn't scale at all. <br>
                          <br>
                          Test1: 512^3 grid points<br>
                          Core#        telescope_reduction_factor    <wbr>   
                          MG levels# for up/down solver     Time for
                          KSPSolve (s)<br>
                          512             8                             <wbr>                   
                          4 / 3                             <wbr>                
                          6.2466<br>
                          4096           64                            <wbr>                  
                          5 / 3                             <wbr>                
                          0.9361<br>
                          32768         64                            <wbr>                  
                          4 / 3                             <wbr>                
                          4.8914<br>
                          <br>
                          Test2: 1024^3 grid points<br>
                          Core#        telescope_reduction_factor    <wbr>   
                          MG levels# for up/down solver     Time for
                          KSPSolve (s)<br>
                          4096           64                            <wbr>                  
                          5 / 4                               <wbr>              
                          3.4139<br>
                          8192           128                           <wbr>                 
                          5 / 4                             <wbr>                
                          2.4196<br>
                          16384         32         
                                                        <wbr>       5 /
                          3                               <wbr>              
                          5.4150<br>
                          32768         64                            <wbr>                  
                          5 / 3                             <wbr>                
                          5.6067<br>
                          65536         128                           <wbr>                 
                          5 / 3                             <wbr>                
                          6.5219</div>
                      </blockquote>
                      <div><br>
                      </div>
                      <div>You have to be very careful how you interpret
                        these numbers. Your solver contains nested calls
                        to KSPSolve, and unfortunately as a result the
                        numbers you report include setup time. This will
                        remain true even if you call KSPSetUp on the
                        outermost KSP. </div>
                      <div><br>
                      </div>
                      <div>Your email concerns scalability of the silver
                        application, so let's focus on that issue.</div>
                      <div><br>
                      </div>
                      <div>The only way to clearly separate setup from
                        solve time is to perform two identical solves.
                        The second solve will not require any setup. You
                        should monitor the second solve via a new
                        PetscStage.</div>
                      <div><br>
                      </div>
                      <div>This was what I did in the telescope paper.
                        It was the only way to understand the setup cost
                        (and scaling) cf the solve time (and scaling).</div>
                      <div><br>
                      </div>
                      <div>Thanks</div>
                      <div>  Dave</div>
                      <div>
                        <div>
                          <div><br>
                          </div>
                          <div> </div>
                          <blockquote class="gmail_quote"
                            style="margin:0 0 0 .8ex;border-left:1px
                            #ccc solid;padding-left:1ex">
                            <div bgcolor="#FFFFFF" text="#000000"> I
                              guess I didn't set the MG levels properly.
                              What would be the efficient way to arrange
                              the MG levels?<br>
                              Also which preconditionr at the coarse
                              mesh of the 2nd communicator should I use
                              to improve the performance? <br>
                              <br>
                              I attached the test code and the petsc
                              options file for the 1024^3 cube with
                              32768 cores. <br>
                              <br>
                              Thank you.<br>
                              <br>
                              Regards,<br>
                              Frank<br>
                              <br>
                              <br>
                              <br>
                              <br>
                              <br>
                              <br>
                              <div>On 09/15/2016 03:35 AM, Dave May
                                wrote:<br>
                              </div>
                              <blockquote type="cite">
                                <div dir="ltr">
                                  <div>
                                    <div>
                                      <div>
                                        <div>
                                          <div>HI all,<br>
                                            <br>
                                          </div>
                                          <div>I the only unexpected
                                            memory usage I can see is
                                            associated with the call to
                                            MatPtAP().<br>
                                          </div>
                                          <div>Here is something you can
                                            try immediately.<br>
                                          </div>
                                        </div>
                                        Run your code with the
                                        additional options<br>
                                          -matrap 0 -matptap_scalable<br>
                                        <br>
                                      </div>
                                      <div>I didn't realize this before,
                                        but the default behaviour of
                                        MatPtAP in parallel is actually
                                        to to explicitly form the
                                        transpose of P (e.g. assemble R
                                        = P^T) and then compute R.A.P. <br>
                                        You don't want to do this. The
                                        option -matrap 0 resolves this
                                        issue.<br>
                                      </div>
                                      <div><br>
                                      </div>
                                      <div>The implementation of P^T.A.P
                                        has two variants. <br>
                                        The scalable implementation
                                        (with respect to memory usage)
                                        is selected via the second
                                        option -matptap_scalable.</div>
                                      <div><br>
                                      </div>
                                      <div>Try it out - I see a
                                        significant memory reduction
                                        using these options for
                                        particular mesh sizes /
                                        partitions.<br>
                                      </div>
                                      <div><br>
                                      </div>
                                      I've attached a cleaned up version
                                      of the code you sent me.<br>
                                    </div>
                                    There were a number of memory leaks
                                    and other issues.<br>
                                  </div>
                                  <div>The main points being<br>
                                  </div>
                                    * You should call
                                  DMDAVecGetArrayF90() before
                                  VecAssembly{Begin,End}<br>
                                    * You should call PetscFinalize(),
                                  otherwise the option -log_summary
                                  (-log_view) will not display anything
                                  once the program has completed.<br>
                                  <div>
                                    <div>
                                      <div><br>
                                        <br>
                                      </div>
                                      <div>Thanks,<br>
                                      </div>
                                      <div>  Dave<br>
                                      </div>
                                      <div>
                                        <div>
                                          <div><br>
                                          </div>
                                        </div>
                                      </div>
                                    </div>
                                  </div>
                                </div>
                                <div class="gmail_extra"><br>
                                  <div class="gmail_quote">On 15
                                    September 2016 at 08:03, Hengjie
                                    Wang <span dir="ltr"><<a
                                        moz-do-not-send="true">hengjiew@uci.edu</a>></span>
                                    wrote:<br>
                                    <blockquote class="gmail_quote"
                                      style="margin:0 0 0
                                      .8ex;border-left:1px #ccc
                                      solid;padding-left:1ex">
                                      <div bgcolor="#FFFFFF"
                                        text="#000000"> Hi Dave,<br>
                                        <br>
                                        Sorry, I should have put more
                                        comment to explain the code.  <br>
                                        The number of process in each
                                        dimension is the same: Px =
                                        Py=Pz=P. So is the domain size.<br>
                                        So if the you want to run the
                                        code for a  512^3 grid points on
                                        16^3 cores, you need to set "-N
                                        512 -P 16" in the command line.<br>
                                        I add more comments and also fix
                                        an error in the attached code. (
                                        The error only effects the
                                        accuracy of solution but not the
                                        memory usage. ) <br>
                                        <div><br>
                                          Thank you.<span><font
                                              color="#888888"><br>
                                              Frank</font></span>
                                          <div>
                                            <div><br>
                                              <br>
                                              On 9/14/2016 9:05 PM, Dave
                                              May wrote:<br>
                                            </div>
                                          </div>
                                        </div>
                                        <div>
                                          <div>
                                            <blockquote type="cite"><br>
                                              <br>
                                              On Thursday, 15 September
                                              2016, Dave May <<a
                                                moz-do-not-send="true">dave.mayhem23@gmail.com</a>>
                                              wrote:<br>
                                              <blockquote
                                                class="gmail_quote"
                                                style="margin:0 0 0
                                                .8ex;border-left:1px
                                                #ccc
                                                solid;padding-left:1ex"><br>
                                                <br>
                                                On Thursday, 15
                                                September 2016, frank
                                                <<a
                                                  moz-do-not-send="true">hengjiew@uci.edu</a>>
                                                wrote:<br>
                                                <blockquote
                                                  class="gmail_quote"
                                                  style="margin:0 0 0
                                                  .8ex;border-left:1px
                                                  #ccc
                                                  solid;padding-left:1ex">
                                                  <div bgcolor="#FFFFFF"
                                                    text="#000000"> Hi,
                                                    <br>
                                                    <br>
                                                    I write a simple
                                                    code to re-produce
                                                    the error. I hope
                                                    this can help to
                                                    diagnose the
                                                    problem.<br>
                                                    The code just solves
                                                    a 3d poisson
                                                    equation. </div>
                                                </blockquote>
                                                <div><br>
                                                </div>
                                                <div>Why is the stencil
                                                  width a runtime
                                                  parameter?? And why is
                                                  the default value 2?
                                                  For 7-pnt FD Laplace,
                                                  you only need
                                                  a stencil width of 1. </div>
                                                <div><br>
                                                </div>
                                                <div>Was this choice
                                                  made to mimic
                                                  something in the
                                                  real application code?</div>
                                              </blockquote>
                                              <div><br>
                                              </div>
                                              Please ignore - I
                                              misunderstood your usage
                                              of the param set by -P
                                              <div>
                                                <div> </div>
                                                <blockquote
                                                  class="gmail_quote"
                                                  style="margin:0 0 0
                                                  .8ex;border-left:1px
                                                  #ccc
                                                  solid;padding-left:1ex">
                                                  <div> </div>
                                                  <blockquote
                                                    class="gmail_quote"
                                                    style="margin:0 0 0
                                                    .8ex;border-left:1px
                                                    #ccc
                                                    solid;padding-left:1ex">
                                                    <div
                                                      bgcolor="#FFFFFF"
                                                      text="#000000"><br>
                                                      I run the code on
                                                      a 1024^3 mesh. The
                                                      process partition
                                                      is 32 * 32 * 32.
                                                      That's when I
                                                      re-produce the OOM
                                                      error. Each core
                                                      has about 2G
                                                      memory.<br>
                                                      I also run the
                                                      code on a 512^3
                                                      mesh with 16 * 16
                                                      * 16 processes.
                                                      The ksp solver
                                                      works fine. <br>
                                                      I attached the
                                                      code,
                                                      ksp_view_pre's
                                                      output and my
                                                      petsc option file.<br>
                                                      <br>
                                                      Thank you.<br>
                                                      Frank<br>
                                                      <div><br>
                                                        On 09/09/2016
                                                        06:38 PM,
                                                        Hengjie Wang
                                                        wrote:<br>
                                                      </div>
                                                      <blockquote
                                                        type="cite">Hi
                                                        Barry, 
                                                        <div><br>
                                                        </div>
                                                        <div>I checked.
                                                          On the
                                                          supercomputer,
                                                          I had the
                                                          option
                                                          "-ksp_view_pre"
                                                          but it is not
                                                          in file I sent
                                                          you. I am
                                                          sorry for the
                                                          confusion.</div>
                                                        <div><br>
                                                        </div>
                                                        <div>Regards,</div>
                                                        <div>Frank<span></span><br>
                                                          <br>
                                                          On Friday,
                                                          September 9,
                                                          2016, Barry
                                                          Smith <<a
                                                          moz-do-not-send="true">bsmith@mcs.anl.gov</a>>
                                                          wrote:<br>
                                                          <blockquote
                                                          class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
                                                          > On Sep 9,
                                                          2016, at 3:11
                                                          PM, frank <<a
moz-do-not-send="true">hengjiew@uci.edu</a>> wrote:<br>
                                                          ><br>
                                                          > Hi Barry,<br>
                                                          ><br>
                                                          > I think
                                                          the first KSP
                                                          view output is
                                                          from
                                                          -ksp_view_pre.
                                                          Before I
                                                          submitted the
                                                          test, I was
                                                          not sure
                                                          whether there
                                                          would be OOM
                                                          error or not.
                                                          So I added
                                                          both
                                                          -ksp_view_pre
                                                          and -ksp_view.<br>
                                                          <br>
                                                            But the
                                                          options file
                                                          you sent
                                                          specifically
                                                          does NOT list
                                                          the
                                                          -ksp_view_pre
                                                          so how could
                                                          it be from
                                                          that?<br>
                                                          <br>
                                                             Sorry to be
                                                          pedantic but
                                                          I've spent too
                                                          much time in
                                                          the past
                                                          trying to
                                                          debug from
                                                          incorrect
                                                          information
                                                          and want to
                                                          make sure that
                                                          the
                                                          information I
                                                          have is
                                                          correct before
                                                          thinking.
                                                          Please recheck
                                                          exactly what
                                                          happened.
                                                          Rerun with the
                                                          exact input
                                                          file you
                                                          emailed if
                                                          that is
                                                          needed.<br>
                                                          <br>
                                                             Barry<br>
                                                          <br>
                                                          ><br>
                                                          > Frank<br>
                                                          ><br>
                                                          ><br>
                                                          > On
                                                          09/09/2016
                                                          12:38 PM,
                                                          Barry Smith
                                                          wrote:<br>
                                                          >>   Why
                                                          does
                                                          ksp_view2.txt
                                                          have two KSP
                                                          views in it
                                                          while
                                                          ksp_view1.txt
                                                          has only one
                                                          KSPView in it?
                                                          Did you run
                                                          two different
                                                          solves in the
                                                          2 case but not
                                                          the one?<br>
                                                          >><br>
                                                          >> 
                                                           Barry<br>
                                                          >><br>
                                                          >><br>
                                                          >><br>
                                                          >>>
                                                          On Sep 9,
                                                          2016, at 10:56
                                                          AM, frank <<a
moz-do-not-send="true">hengjiew@uci.edu</a>> wrote:<br>
                                                          >>><br>
                                                          >>>
                                                          Hi,<br>
                                                          >>><br>
                                                          >>> I
                                                          want to
                                                          continue
                                                          digging into
                                                          the memory
                                                          problem here.<br>
                                                          >>> I
                                                          did find a
                                                          work around in
                                                          the past,
                                                          which is to
                                                          use less cores
                                                          per node so
                                                          that each core
                                                          has 8G memory.
                                                          However this
                                                          is deficient
                                                          and expensive.
                                                          I hope to
                                                          locate the
                                                          place that
                                                          uses the most
                                                          memory.<br>
                                                          >>><br>
                                                          >>>
                                                          Here is a
                                                          brief summary
                                                          of the tests I
                                                          did in past:<br>
>>>> Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12<br>
                                                          >>>
                                                          Maximum (over
                                                          computational
                                                          time) process
                                                          memory:       
                                                             total
                                                          7.0727e+08<br>
                                                          >>>
                                                          Current
                                                          process
                                                          memory:       
                                                                       
                                                                       
                                                                       
                                                                 total
                                                          7.0727e+08<br>
                                                          >>>
                                                          Maximum (over
                                                          computational
                                                          time) space
                                                          PetscMalloc()ed: 
                                                          total
                                                          6.3908e+11<br>
                                                          >>>
                                                          Current space
PetscMalloc()ed:                                                total
                                                          1.8275e+09<br>
                                                          >>><br>
>>>> Test2:    Mesh 1536*128*384  |  Process Mesh 96*8*24<br>
                                                          >>>
                                                          Maximum (over
                                                          computational
                                                          time) process
                                                          memory:       
                                                             total
                                                          5.9431e+09<br>
                                                          >>>
                                                          Current
                                                          process
                                                          memory:       
                                                                       
                                                                       
                                                                       
                                                                 total
                                                          5.9431e+09<br>
                                                          >>>
                                                          Maximum (over
                                                          computational
                                                          time) space
                                                          PetscMalloc()ed: 
                                                          total
                                                          5.3202e+12<br>
                                                          >>>
                                                          Current space
PetscMalloc()ed:                                                 total
                                                          5.4844e+09<br>
                                                          >>><br>
>>>> Test3:    Mesh 3072*256*768  |  Process Mesh 96*8*24<br>
                                                          >>> 
                                                             OOM( Out Of
                                                          Memory )
                                                          killer of the
                                                          supercomputer
                                                          terminated the
                                                          job during
                                                          "KSPSolve".<br>
                                                          >>><br>
                                                          >>> I
                                                          attached the
                                                          output of
                                                          ksp_view( the
                                                          third test's
                                                          output is from
                                                          ksp_view_pre
                                                          ), memory_view
                                                          and also the
                                                          petsc options.<br>
                                                          >>><br>
                                                          >>>
                                                          In all the
                                                          tests, each
                                                          core can
                                                          access about
                                                          2G memory. In
                                                          test3, there
                                                          are 4223139840
                                                          non-zeros in
                                                          the matrix.
                                                          This will
                                                          consume about
                                                          1.74M, using
                                                          double
                                                          precision.
                                                          Considering
                                                          some extra
                                                          memory used to
                                                          store integer
                                                          index, 2G
                                                          memory should
                                                          still be way
                                                          enough.<br>
                                                          >>><br>
                                                          >>>
                                                          Is there a way
                                                          to find out
                                                          which part of
                                                          KSPSolve uses
                                                          the most
                                                          memory?<br>
                                                          >>>
                                                          Thank you so
                                                          much.<br>
                                                          >>><br>
                                                          >>>
                                                          BTW, there are
                                                          4 options
                                                          remains unused
                                                          and I don't
                                                          understand why
                                                          they are
                                                          omitted:<br>
                                                          >>>
                                                          -mg_coarse_telescope_mg_coarse<wbr>_ksp_type
                                                          value: preonly<br>
                                                          >>>
                                                          -mg_coarse_telescope_mg_coarse<wbr>_pc_type
                                                          value: bjacobi<br>
                                                          >>>
                                                          -mg_coarse_telescope_mg_levels<wbr>_ksp_max_it
                                                          value: 1<br>
                                                          >>>
                                                          -mg_coarse_telescope_mg_levels<wbr>_ksp_type
                                                          value:
                                                          richardson<br>
                                                          >>><br>
                                                          >>><br>
                                                          >>>
                                                          Regards,<br>
                                                          >>>
                                                          Frank<br>
                                                          >>><br>
                                                          >>>
                                                          On 07/13/2016
                                                          05:47 PM, Dave
                                                          May wrote:<br>
>>>><br>
>>>> On 14 July 2016 at 01:07, frank <<a
                                                          moz-do-not-send="true">hengjiew@uci.edu</a>>
                                                          wrote:<br>
>>>> Hi Dave,<br>
>>>><br>
>>>> Sorry for the late reply.<br>
>>>> Thank you so much for your detailed reply.<br>
>>>><br>
>>>> I have a question about the estimation of the memory
                                                          usage. There
                                                          are 4223139840
                                                          allocated
                                                          non-zeros and
                                                          18432 MPI
                                                          processes.
                                                          Double
                                                          precision is
                                                          used. So the
                                                          memory per
                                                          process is:<br>
>>>>   4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?<br>
>>>> Did I do sth wrong here? Because this seems too small.<br>
>>>><br>
>>>> No - I totally f***ed it up. You are correct. That'll
                                                          teach me for
                                                          fumbling
                                                          around with my
                                                          iphone
                                                          calculator and
                                                          not using my
                                                          brain. (Note
                                                          that to
                                                          convert to MB
                                                          just divide by
                                                          1e6, not
                                                          1024^2 -
                                                          although I
                                                          apparently
                                                          cannot convert
                                                          between units
                                                          correctly....)<br>
>>>><br>
>>>> From the PETSc objects associated with the solver, It
                                                          looks like it
                                                          _should_ run
                                                          with 2GB per
                                                          MPI rank.
                                                          Sorry for my
                                                          mistake.
                                                          Possibilities
                                                          are: somewhere
                                                          in your usage
                                                          of PETSc
                                                          you've
                                                          introduced a
                                                          memory leak;
                                                          PETSc is doing
                                                          a huge over
                                                          allocation
                                                          (e.g. as per
                                                          our discussion
                                                          of MatPtAP);
                                                          or in your
                                                          application
                                                          code there are
                                                          other objects
                                                          you have
                                                          forgotten to
                                                          log the memory
                                                          for.<br>
>>>><br>
>>>><br>
>>>><br>
>>>> I am running this job on Bluewater<br>
>>>> I am using the 7 points FD stencil in 3D.<br>
>>>><br>
>>>> I thought so on both counts.<br>
>>>><br>
>>>> I apologize that I made a stupid mistake in computing
                                                          the memory per
                                                          core. My
                                                          settings
                                                          render each
                                                          core can
                                                          access only 2G
                                                          memory on
                                                          average
                                                          instead of 8G
                                                          which I
                                                          mentioned in
                                                          previous
                                                          email. I
                                                          re-run the job
                                                          with 8G memory
                                                          per core on
                                                          average and
                                                          there is no
                                                          "Out Of
                                                          Memory" error.
                                                          I would do
                                                          more test to
                                                          see if there
                                                          is still some
                                                          memory issue.<br>
>>>><br>
>>>> Ok. I'd still like to know where the memory was being
                                                          used since my
                                                          estimates were
                                                          off.<br>
>>>><br>
>>>><br>
>>>> Thanks,<br>
>>>>   Dave<br>
>>>><br>
>>>> Regards,<br>
>>>> Frank<br>
>>>><br>
>>>><br>
>>>><br>
>>>> On 07/11/2016 01:18 PM, Dave May wrote:<br>
>>>>> Hi Frank,<br>
>>>>><br>
>>>>><br>
>>>>> On 11 July 2016 at 19:14, frank <<a
                                                          moz-do-not-send="true">hengjiew@uci.edu</a>>
                                                          wrote:<br>
>>>>> Hi Dave,<br>
>>>>><br>
>>>>> I re-run the test using bjacobi as the
                                                          preconditioner
                                                          on the coarse
                                                          mesh of
                                                          telescope. The
                                                          Grid is
                                                          3072*256*768
                                                          and process
                                                          mesh is
                                                          96*8*24. The
                                                          petsc option
                                                          file is
                                                          attached.<br>
>>>>> I still got the "Out Of Memory" error. The error
                                                          occurred
                                                          before the
                                                          linear solver
                                                          finished one
                                                          step. So I
                                                          don't have the
                                                          full info from
                                                          ksp_view. The
                                                          info from
                                                          ksp_view_pre
                                                          is attached.<br>
>>>>><br>
>>>>> Okay - that is essentially useless (sorry)<br>
>>>>><br>
>>>>> It seems to me that the error occurred when the
                                                          decomposition
                                                          was going to
                                                          be changed.<br>
>>>>><br>
>>>>> Based on what information?<br>
>>>>> Running with -info would give us more clues, but
                                                          will create a
                                                          ton of output.<br>
>>>>> Please try running the case which failed with -info<br>
>>>>>  I had another test with a grid of 1536*128*384 and
                                                          the same
                                                          process mesh
                                                          as above.
                                                          There was no
                                                          error. The
                                                          ksp_view info
                                                          is attached
                                                          for
                                                          comparison.<br>
>>>>> Thank you.<br>
>>>>><br>
>>>>><br>
>>>>> [3] Here is my crude estimate of your memory usage.<br>
>>>>> I'll target the biggest memory hogs only to get an
                                                          order of
                                                          magnitude
                                                          estimate<br>
>>>>><br>
>>>>> * The Fine grid operator contains 4223139840
                                                          non-zeros
                                                          --> 1.8 GB
                                                          per MPI rank
                                                          assuming
                                                          double
                                                          precision.<br>
>>>>> The indices for the AIJ could amount to another 0.3
                                                          GB (assuming
                                                          32 bit
                                                          integers)<br>
>>>>><br>
>>>>> * You use 5 levels of coarsening, so the other
                                                          operators
                                                          should
                                                          represent
                                                          (collectively)<br>
>>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300 MB per
                                                          MPI rank on
                                                          the
                                                          communicator
                                                          with 18432
                                                          ranks.<br>
>>>>> The coarse grid should consume ~ 0.5 MB per MPI
                                                          rank on the
                                                          communicator
                                                          with 18432
                                                          ranks.<br>
>>>>><br>
>>>>> * You use a reduction factor of 64, making the new
                                                          communicator
                                                          with 288 MPI
                                                          ranks.<br>
>>>>> PCTelescope will first gather a temporary matrix
                                                          associated
                                                          with your
                                                          coarse level
                                                          operator
                                                          assuming a
                                                          comm size of
                                                          288 living on
                                                          the comm with
                                                          size 18432.<br>
>>>>> This matrix will require approximately 0.5 * 64 =
                                                          32 MB per core
                                                          on the 288
                                                          ranks.<br>
>>>>> This matrix is then used to form a new MPIAIJ
                                                          matrix on the
                                                          subcomm, thus
                                                          require
                                                          another 32 MB
                                                          per rank.<br>
>>>>> The temporary matrix is now destroyed.<br>
>>>>><br>
>>>>> * Because a DMDA is detected, a permutation matrix
                                                          is assembled.<br>
>>>>> This requires 2 doubles per point in the DMDA.<br>
>>>>> Your coarse DMDA contains 92 x 16 x 48 points.<br>
>>>>> Thus the permutation matrix will require < 1 MB
                                                          per MPI rank
                                                          on the
                                                          sub-comm.<br>
>>>>><br>
>>>>> * Lastly, the matrix is permuted. This uses
                                                          MatPtAP(), but
                                                          the resulting
                                                          operator will
                                                          have the same
                                                          memory
                                                          footprint as
                                                          the unpermuted
                                                          matrix (32
                                                          MB). At any
                                                          stage in
                                                          PCTelescope,
                                                          only 2
                                                          operators of
                                                          size 32 MB are
                                                          held in memory
                                                          when the DMDA
                                                          is provided.<br>
>>>>><br>
>>>>> From my rough estimates, the worst case memory foot
                                                          print for any
                                                          given core,
                                                          given your
                                                          options is
                                                          approximately<br>
>>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB<br>
>>>>> This is way below 8 GB.<br>
>>>>><br>
>>>>> Note this estimate completely ignores:<br>
>>>>> (1) the memory required for the restriction
                                                          operator,<br>
>>>>> (2) the potential growth in the number of non-zeros
                                                          per row due to
                                                          Galerkin
                                                          coarsening (I
                                                          wished
                                                          -ksp_view_pre
                                                          reported the
                                                          output from
                                                          MatView so we
                                                          could see the
                                                          number of
                                                          non-zeros
                                                          required by
                                                          the coarse
                                                          level
                                                          operators)<br>
>>>>> (3) all temporary vectors required by the CG
                                                          solver, and
                                                          those required
                                                          by the
                                                          smoothers.<br>
>>>>> (4) internal memory allocated by MatPtAP<br>
>>>>> (5) memory associated with IS's used within
                                                          PCTelescope<br>
>>>>><br>
>>>>> So either I am completely off in my estimates, or
                                                          you have not
                                                          carefully
                                                          estimated the
                                                          memory usage
                                                          of your
                                                          application
                                                          code.
                                                          Hopefully
                                                          others might
                                                          examine/correct
                                                          my rough
                                                          estimates<br>
>>>>><br>
>>>>> Since I don't have your code I cannot access the
                                                          latter.<br>
>>>>> Since I don't have access to the same machine you
                                                          are running
                                                          on, I think we
                                                          need to take a
                                                          step back.<br>
>>>>><br>
>>>>> [1] What machine are you running on? Send me a URL
                                                          if its
                                                          available<br>
>>>>><br>
>>>>> [2] What discretization are you using? (I am
                                                          guessing a
                                                          scalar 7 point
                                                          FD stencil)<br>
>>>>> If it's a 7 point FD stencil, we should be able to
                                                          examine the
                                                          memory usage
                                                          of your solver
                                                          configuration
                                                          using a
                                                          standard,
                                                          light weight
                                                          existing PETSc
                                                          example, run
                                                          on your
                                                          machine at the
                                                          same scale.<br>
>>>>> This would hopefully enable us to correctly
                                                          evaluate the
                                                          actual memory
                                                          usage required
                                                          by the solver
                                                          configuration
                                                          you are using.<br>
>>>>><br>
>>>>> Thanks,<br>
>>>>>   Dave<br>
>>>>><br>
>>>>><br>
>>>>> Frank<br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>> On 07/08/2016 10:38 PM, Dave May wrote:<br>
>>>>>><br>
>>>>>> On Saturday, 9 July 2016, frank <<a
                                                          moz-do-not-send="true">hengjiew@uci.edu</a>>
                                                          wrote:<br>
>>>>>> Hi Barry and Dave,<br>
>>>>>><br>
>>>>>> Thank both of you for the advice.<br>
>>>>>><br>
>>>>>> @Barry<br>
>>>>>> I made a mistake in the file names in last
                                                          email. I
                                                          attached the
                                                          correct files
                                                          this time.<br>
>>>>>> For all the three tests, 'Telescope' is used as
                                                          the coarse
                                                          preconditioner.<br>
>>>>>><br>
>>>>>> == Test1:   Grid: 1536*128*384,   Process Mesh:
                                                          48*4*12<br>
>>>>>> Part of the memory usage:  Vector   125       
                                                              124
                                                          3971904     0.<br>
>>>>>>                                             
                                                          Matrix   101
                                                          101     
                                                          9462372     0<br>
>>>>>><br>
>>>>>> == Test2: Grid: 1536*128*384,   Process Mesh:
                                                          96*8*24<br>
>>>>>> Part of the memory usage:  Vector   125       
                                                              124
                                                          681672     0.<br>
>>>>>>                                             
                                                          Matrix   101
                                                          101     
                                                          1462180     0.<br>
>>>>>><br>
>>>>>> In theory, the memory usage in Test1 should be
                                                          8 times of
                                                          Test2. In my
                                                          case, it is
                                                          about 6 times.<br>
>>>>>><br>
>>>>>> == Test3: Grid: 3072*256*768,   Process Mesh:
                                                          96*8*24.
                                                          Sub-domain per
                                                          process:
                                                          32*32*32<br>
>>>>>> Here I get the out of memory error.<br>
>>>>>><br>
>>>>>> I tried to use -mg_coarse jacobi. In this way,
                                                          I don't need
                                                          to set
                                                          -mg_coarse_ksp_type
                                                          and
                                                          -mg_coarse_pc_type
                                                          explicitly,
                                                          right?<br>
>>>>>> The linear solver didn't work in this case.
                                                          Petsc output
                                                          some errors.<br>
>>>>>><br>
>>>>>> @Dave<br>
>>>>>> In test3, I use only one instance of
                                                          'Telescope'.
                                                          On the coarse
                                                          mesh of
                                                          'Telescope', I
                                                          used LU as the
                                                          preconditioner
                                                          instead of
                                                          SVD.<br>
>>>>>> If my set the levels correctly, then on the
                                                          last coarse
                                                          mesh of MG
                                                          where it calls
                                                          'Telescope',
                                                          the sub-domain
                                                          per process is
                                                          2*2*2.<br>
>>>>>> On the last coarse mesh of 'Telescope', there
                                                          is only one
                                                          grid point per
                                                          process.<br>
>>>>>> I still got the OOM error. The detailed petsc
                                                          option file is
                                                          attached.<br>
>>>>>><br>
>>>>>> Do you understand the expected memory usage for
                                                          the particular
                                                          parallel LU
                                                          implementation
                                                          you are using?
                                                          I don't
                                                          (seriously).
                                                          Replace LU
                                                          with bjacobi
                                                          and re-run
                                                          this test. My
                                                          point about
                                                          solver
                                                          debugging is
                                                          still valid.<br>
>>>>>><br>
>>>>>> And please send the result of KSPView so we can
                                                          see what is
                                                          actually used
                                                          in the
                                                          computations<br>
>>>>>><br>
>>>>>> Thanks<br>
>>>>>>   Dave<br>
>>>>>><br>
>>>>>><br>
>>>>>> Thank you so much.<br>
>>>>>><br>
>>>>>> Frank<br>
>>>>>><br>
>>>>>><br>
>>>>>><br>
>>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:<br>
>>>>>> On Jul 6, 2016, at 4:19 PM, frank <<a
                                                          moz-do-not-send="true">hengjiew@uci.edu</a>>
                                                          wrote:<br>
>>>>>><br>
>>>>>> Hi Barry,<br>
>>>>>><br>
>>>>>> Thank you for you advice.<br>
>>>>>> I tried three test. In the 1st test, the grid
                                                          is
                                                          3072*256*768
                                                          and the
                                                          process mesh
                                                          is 96*8*24.<br>
>>>>>> The linear solver is 'cg' the preconditioner is
                                                          'mg' and
                                                          'telescope' is
                                                          used as the
                                                          preconditioner
                                                          at the coarse
                                                          mesh.<br>
>>>>>> The system gives me the "Out of Memory" error
                                                          before the
                                                          linear system
                                                          is completely
                                                          solved.<br>
>>>>>> The info from '-ksp_view_pre' is attached. I
                                                          seems to me
                                                          that the error
                                                          occurs when it
                                                          reaches the
                                                          coarse mesh.<br>
>>>>>><br>
>>>>>> The 2nd test uses a grid of 1536*128*384 and
                                                          process mesh
                                                          is 96*8*24.
                                                          The 3rd       
                                                                       
                                                                       
                                                                   test
                                                          uses the same
                                                          grid but a
                                                          different
                                                          process mesh
                                                          48*4*12.<br>
>>>>>>     Are you sure this is right? The total
                                                          matrix and
                                                          vector memory
                                                          usage goes
                                                          from 2nd test<br>
>>>>>>                Vector   384            383     
                                                          8,193,712   
                                                           0.<br>
>>>>>>                Matrix   103            103   
                                                           11,508,688   
                                                           0.<br>
>>>>>> to 3rd test<br>
>>>>>>               Vector   384            383     
                                                          1,590,520   
                                                           0.<br>
>>>>>>                Matrix   103            103     
                                                          3,508,664   
                                                           0.<br>
>>>>>> that is the memory usage got smaller but if you
                                                          have only
                                                          1/8th the
                                                          processes and
                                                          the same grid
                                                          it should have
                                                          gotten about 8
                                                          times bigger.
                                                          Did you maybe
                                                          cut the grid
                                                          by a factor of
                                                          8 also? If so
                                                          that still
                                                          doesn't
                                                          explain it
                                                          because the
                                                          memory usage
                                                          changed by a
                                                          factor of 5
                                                          something for
                                                          the vectors
                                                          and 3
                                                          something for
                                                          the matrices.<br>
>>>>>><br>
>>>>>><br>
>>>>>> The linear solver and petsc options in 2nd and
                                                          3rd tests are
                                                          the same in
                                                          1st test. The
                                                          linear solver
                                                          works fine in
                                                          both test.<br>
>>>>>> I attached the memory usage of the 2nd and 3rd
                                                          tests. The
                                                          memory info is
                                                          from the
                                                          option
                                                          '-log_summary'.
                                                          I tried to use
                                                          '-momery_info'
                                                          as you
                                                          suggested, but
                                                          in my case
                                                          petsc treated
                                                          it as an
                                                          unused option.
                                                          It output
                                                          nothing about
                                                          the memory. Do
                                                          I need to add
                                                          sth to my code
                                                          so I can use
                                                          '-memory_info'?<br>
>>>>>>     Sorry, my mistake the option is
                                                          -memory_view<br>
>>>>>><br>
>>>>>>    Can you run the one case with -memory_view
                                                          and -mg_coarse
                                                          jacobi
                                                          -ksp_max_it 1
                                                          (just so it
                                                          doesn't
                                                          iterate
                                                          forever) to
                                                          see how much
                                                          memory is used
                                                          without the
                                                          telescope?
                                                          Also run case
                                                          2 the same
                                                          way.<br>
>>>>>><br>
>>>>>>    Barry<br>
>>>>>><br>
>>>>>><br>
>>>>>><br>
>>>>>> In both tests the memory usage is not large.<br>
>>>>>><br>
>>>>>> It seems to me that it might be the
                                                          'telescope' 
                                                          preconditioner
                                                          that allocated
                                                          a lot of
                                                          memory and
                                                          caused the
                                                          error in the
                                                          1st test.<br>
>>>>>> Is there is a way to show how much memory it
                                                          allocated?<br>
>>>>>><br>
>>>>>> Frank<br>
>>>>>><br>
>>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote:<br>
>>>>>>    Frank,<br>
>>>>>><br>
>>>>>>      You can run with -ksp_view_pre to have it
                                                          "view" the KSP
                                                          before the
                                                          solve so
                                                          hopefully it
                                                          gets that far.<br>
>>>>>><br>
>>>>>>       Please run the problem that does fit with
                                                          -memory_info
                                                          when the
                                                          problem
                                                          completes it
                                                          will show the
                                                          "high water
                                                          mark" for
                                                          PETSc
                                                          allocated
                                                          memory and
                                                          total memory
                                                          used. We first
                                                          want to look
                                                          at these
                                                          numbers to see
                                                          if it is using
                                                          more memory
                                                          than you
                                                          expect. You
                                                          could also run
                                                          with say half
                                                          the grid
                                                          spacing to see
                                                          how the memory
                                                          usage scaled
                                                          with the
                                                          increase in
                                                          grid points.
                                                          Make the runs
                                                          also with
                                                          -log_view and
                                                          send all the
                                                          output from
                                                          these options.<br>
>>>>>><br>
>>>>>>     Barry<br>
>>>>>><br>
>>>>>> On Jul 5, 2016, at 5:23 PM, frank <<a
                                                          moz-do-not-send="true">hengjiew@uci.edu</a>>
                                                          wrote:<br>
>>>>>><br>
>>>>>> Hi,<br>
>>>>>><br>
>>>>>> I am using the CG ksp solver and Multigrid
                                                          preconditioner 
                                                          to solve a
                                                          linear system
                                                          in parallel.<br>
>>>>>> I chose to use the 'Telescope' as the
                                                          preconditioner
                                                          on the coarse
                                                          mesh for its
                                                          good
                                                          performance.<br>
>>>>>> The petsc options file is attached.<br>
>>>>>><br>
>>>>>> The domain is a 3d box.<br>
>>>>>> It works well when the grid is  1536*128*384
                                                          and the
                                                          process mesh
                                                          is 96*8*24.
                                                          When I double
                                                          the size of
                                                          grid and     
                                                                       
                                                                       
                                                                       
                                                           keep the same
                                                          process mesh
                                                          and petsc
                                                          options, I get
                                                          an "out of
                                                          memory" error
                                                          from the
                                                          super-cluster
                                                          I am using.<br>
>>>>>> Each process has access to at least 8G memory,
                                                          which should
                                                          be more than
                                                          enough for my
                                                          application. I
                                                          am sure that
                                                          all the other
                                                          parts of my
                                                          code( except
                                                          the linear
                                                          solver ) do
                                                          not use much
                                                          memory. So I
                                                          doubt if there
                                                          is something
                                                          wrong with the
                                                          linear solver.<br>
>>>>>> The error occurs before the linear system is
                                                          completely
                                                          solved so I
                                                          don't have the
                                                          info from ksp
                                                          view. I am not
                                                          able to
                                                          re-produce the
                                                          error with a
                                                          smaller
                                                          problem
                                                          either.<br>
>>>>>> In addition,  I tried to use the block jacobi
                                                          as the
                                                          preconditioner
                                                          with the same
                                                          grid and same
                                                          decomposition.
                                                          The linear
                                                          solver runs
                                                          extremely slow
                                                          but there is
                                                          no memory
                                                          error.<br>
>>>>>><br>
>>>>>> How can I diagnose what exactly cause the
                                                          error?<br>
>>>>>> Thank you so much.<br>
>>>>>><br>
>>>>>> Frank<br>
>>>>>> <petsc_options.txt><br>
>>>>>> <ksp_view_pre.txt><memory_test<wbr>2.txt><memory_test3.txt><petsc<wbr>_options.txt><br>
>>>>>><br>
>>>>><br>
>>>><br>
                                                          >>>
                                                          <ksp_view1.txt><ksp_view2.txt><wbr><ksp_view3.txt><memory1.txt><m<wbr>emory2.txt><petsc_options1.txt<wbr>><petsc_options2.txt><petsc_op<wbr>tions3.txt><br>
                                                          ><br>
                                                          <br>
                                                          </blockquote>
                                                        </div>
                                                      </blockquote>
                                                      <br>
                                                    </div>
                                                  </blockquote>
                                                  <div> </div>
                                                </blockquote>
                                              </div>
                                            </blockquote>
                                            <br>
                                          </div>
                                        </div>
                                      </div>
                                    </blockquote>
                                  </div>
                                  <br>
                                </div>
                              </blockquote>
                              <br>
                            </div>
                          </blockquote>
                          <div> </div>
                          <div> </div>
                          <div> </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
          <br>
          <br clear="all">
          <div><br>
          </div>
          -- <br>
          <div class="gmail_signature" data-smartmail="gmail_signature">What
            most experimenters take for granted before they begin their
            experiments is infinitely more interesting than any results
            to which their experiments lead.<br>
            -- Norbert Wiener</div>
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>