<br><br>On Friday, 7 October 2016, frank <<a href="mailto:hengjiew@uci.edu">hengjiew@uci.edu</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    <p>Dear all,</p>
    <p>Thank you so much for the advice. <br>
    </p>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div dir="ltr">
            <div class="gmail_extra">
              <div class="gmail_quote"><span></span>
                <div>All setup is done in the first solve.</div>
                <span>
                  <div> </div>
                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                    <div> ** The time for 1st solve does not scale. <br>
                          In practice, I am solving a variable
                      coefficient  Poisson equation. I need to build the
                      matrix every time step. Therefore, each step is
                      similar to the 1st solve which does not scale. Is
                      there a way I can improve the performance? <br>
                    </div>
                  </blockquote>
                </span></div>
            </div>
          </div>
          <div class="gmail_quote">
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div dir="ltr">
                <div class="gmail_extra">
                  <div class="gmail_quote"><span>
                      <div><br>
                      </div>
                    </span>
                    <div>You could use rediscretization instead of
                      Galerkin to produce the coarse operators.</div>
                  </div>
                </div>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>
              <div>Yes I can think of one option for improved
                performance, but I cannot tell whether it will be
                beneficial because the logging isn't sufficiently fine
                grained (and there is no easy way to get the info out of
                petsc). <br>
                <br>
                I use PtAP to repartition the matrix, this could be
                consuming most of the setup time in Telescope with your
                run. Such a repartitioning could be avoid if you
                provided a method to create the operator on the coarse
                levels (what Matt is suggesting). However, this requires
                you to be able to define your coefficients on the coarse
                grid. This will most likely reduce setup time, but your
                coarse grid operators (now re-discretized) are likely to
                be less effective than those generated via Galerkin
                coarsening.<br>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    Please correct me if I understand this incorrectly:   I can define
    my own restriction function and pass it to petsc instead of using
    PtAP.<br>
    If so,what's the interface to do that?</div></blockquote><div><br></div><div>You need to provide your provide a method to KSPSetComputeOoerators to your outer KSP</div><div><br></div><div><a href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/KSP/KSPSetComputeOperators.html">http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/KSP/KSPSetComputeOperators.html</a><br></div><div><br></div><div>This method will get propagated through telescope to the KSP running in the sub-comm.</div><div><br></div><div>Note that this functionality is currently not support for fortran. I need to make a small modification to telescope to enable fortran support.</div><div><br></div><div>Thanks</div><div>  Dave</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><br>
     <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div dir="ltr">
                <div class="gmail_extra"><span></span>
                  <div class="gmail_quote">
                    <div>Also, you use CG/MG when FMG by itself would
                      probably be faster. Your smoother is likely not
                      strong enough, and you</div>
                    <div>should use something like V(2,2). There is a
                      lot of tuning that is possible, but difficult to
                      automate.</div>
                  </div>
                </div>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>Matt's completely correct. <br>
              If we could automate this in a meaningful manner, we would
              have done so.<br>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    I am not as familiar with multigrid as you guys. It would be very
    kind if you could be more specific.<br>
    What does V(2,2) stand for? Is there some strong smoother build in
    petsc that I can try?<br>
    <br>
    <br>
    Another thing, the vector assemble and scatter take more time as I
    increased the cores#:<br>
    
    
    
    <br>
     cores#                       <wbr>                4096            
    8192          16384         32768          65536  <br>
    VecAssemblyBegin       298        2.91E+00    2.87E+00   
    8.59E+00    2.75E+01    2.21E+03<br>
    
    
    
    
    
    
    
    VecAssemblyEnd          298        3.37E-03    1.78E-03   
    1.78E-03       5.13E-03    1.99E-03<br>
    
    
    
    
    
    
    
    VecScatterBegin           76303    3.82E+00    3.01E+00   
    2.54E+00    4.40E+00    1.32E+00<br>
    VecScatterEnd              76303    3.09E+01    1.47E+01   
    2.23E+01    2.96E+01    2.10E+01<br>
    <br>
    The above data is produced by solving a constant coefficients
    Possoin equation with different rhs for 100 steps. <br>
    As you can see, the time of VecAssemblyBegin increase dramatically
    from 32K cores to 65K.  <br>
    With 65K cores, it took more time to assemble the rhs than solving
    the equation.   Is there a way to improve this?<br>
    <br>
    <br>
    Thank you.<br>
    <br>
    Regards,<br>
    Frank  <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div dir="ltr">
                <div class="gmail_extra">
                  <div class="gmail_quote">
                    <div>
                      <div>
                        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                          <div bgcolor="#FFFFFF">
                            <div>
                              <div><br>
                                <br>
                                <br>
                                <br>
                                <br>
                                <div>On
                                  10/04/2016 12:56 PM, Dave May wrote:<br>
                                </div>
                                <blockquote type="cite"><br>
                                  <br>
                                  On Tuesday, 4 October 2016, frank <<a href="javascript:_e(%7B%7D,'cvml','hengjiew@uci.edu');" target="_blank">hengjiew@uci.edu</a>>
                                  wrote:<br>
                                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                    <div bgcolor="#FFFFFF">
                                      <p>Hi,</p>
                                      This question is follow-up of the
                                      thread "Question about memory
                                      usage in Multigrid
                                      preconditioner".<br>
                                      I used to have the "Out of
                                      Memory(OOM)" problem when using
                                      the CG+Telescope MG solver with
                                      32768 cores. Adding the "-matrap
                                      0; -matptap_scalable" option did
                                      solve that problem. <br>
                                      <br>
                                      Then I test the scalability by
                                      solving a 3d poisson eqn for 1
                                      step. I used one sub-communicator
                                      in all the tests. The difference
                                      between the petsc options in those
                                      tests are: 1 the
                                      pc_telescope_reduction_factor; 2
                                      the number of multigrid levels in
                                      the up/down solver. The function
                                      "ksp_solve" is timed. It is kind
                                      of slow and doesn't scale at all.
                                      <br>
                                      <br>
                                      Test1: 512^3 grid points<br>
                                      Core#       
                                      telescope_reduction_factor    <wbr>   
                                      MG levels# for up/down solver    
                                      Time for KSPSolve (s)<br>
                                      512            
                                      8                             <wbr>                   
                                      4 / 3                             <wbr>                
                                      6.2466<br>
                                      4096          
                                      64                            <wbr>                  
                                      5 / 3                             <wbr>                
                                      0.9361<br>
                                      32768        
                                      64                            <wbr>                  
                                      4 / 3                             <wbr>                
                                      4.8914<br>
                                      <br>
                                      Test2: 1024^3 grid points<br>
                                      Core#       
                                      telescope_reduction_factor    <wbr>   
                                      MG levels# for up/down solver    
                                      Time for KSPSolve (s)<br>
                                      4096          
                                      64                            <wbr>                  
                                      5 / 4
                                                                    <wbr>              
                                      3.4139<br>
                                      8192          
                                      128                           <wbr>                 
                                      5 / 4                             <wbr>                
                                      2.4196<br>
                                      16384         32         
                                                                    <wbr>      
                                      5 / 3
                                                                    <wbr>              
                                      5.4150<br>
                                      32768        
                                      64                            <wbr>                  
                                      5 / 3                             <wbr>                
                                      5.6067<br>
                                      65536        
                                      128                           <wbr>                 
                                      5 / 3                             <wbr>                
                                      6.5219</div>
                                  </blockquote>
                                  <div><br>
                                  </div>
                                  <div>You have to be very careful how
                                    you interpret these numbers. Your
                                    solver contains nested calls to
                                    KSPSolve, and unfortunately as a
                                    result the numbers you report
                                    include setup time. This will remain
                                    true even if you call KSPSetUp on
                                    the outermost KSP. </div>
                                  <div><br>
                                  </div>
                                  <div>Your email concerns scalability
                                    of the silver application, so let's
                                    focus on that issue.</div>
                                  <div><br>
                                  </div>
                                  <div>The only way to clearly separate
                                    setup from solve time is to perform
                                    two identical solves. The second
                                    solve will not require any setup.
                                    You should monitor the second solve
                                    via a new PetscStage.</div>
                                  <div><br>
                                  </div>
                                  <div>This was what I did in the
                                    telescope paper. It was the only way
                                    to understand the setup cost (and
                                    scaling) cf the solve time (and
                                    scaling).</div>
                                  <div><br>
                                  </div>
                                  <div>Thanks</div>
                                  <div>  Dave</div>
                                  <div>
                                    <div>
                                      <div><br>
                                      </div>
                                      <div> </div>
                                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                        <div bgcolor="#FFFFFF"> I guess
                                          I didn't set the MG levels
                                          properly. What would be the
                                          efficient way to arrange the
                                          MG levels?<br>
                                          Also which preconditionr at
                                          the coarse mesh of the 2nd
                                          communicator should I use to
                                          improve the performance? <br>
                                          <br>
                                          I attached the test code and
                                          the petsc options file for the
                                          1024^3 cube with 32768 cores.
                                          <br>
                                          <br>
                                          Thank you.<br>
                                          <br>
                                          Regards,<br>
                                          Frank<br>
                                          <br>
                                          <br>
                                          <br>
                                          <br>
                                          <br>
                                          <br>
                                          <div>On 09/15/2016 03:35 AM,
                                            Dave May wrote:<br>
                                          </div>
                                          <blockquote type="cite">
                                            <div dir="ltr">
                                              <div>
                                                <div>
                                                  <div>
                                                    <div>
                                                      <div>HI all,<br>
                                                        <br>
                                                      </div>
                                                      <div>I the only
                                                        unexpected
                                                        memory usage I
                                                        can see is
                                                        associated with
                                                        the call to
                                                        MatPtAP().<br>
                                                      </div>
                                                      <div>Here is
                                                        something you
                                                        can try
                                                        immediately.<br>
                                                      </div>
                                                    </div>
                                                    Run your code with
                                                    the additional
                                                    options<br>
                                                      -matrap 0
                                                    -matptap_scalable<br>
                                                    <br>
                                                  </div>
                                                  <div>I didn't realize
                                                    this before, but the
                                                    default behaviour of
                                                    MatPtAP in parallel
                                                    is actually to to
                                                    explicitly form the
                                                    transpose of P (e.g.
                                                    assemble R = P^T)
                                                    and then compute
                                                    R.A.P. <br>
                                                    You don't want to do
                                                    this. The option
                                                    -matrap 0 resolves
                                                    this issue.<br>
                                                  </div>
                                                  <div><br>
                                                  </div>
                                                  <div>The
                                                    implementation of
                                                    P^T.A.P has two
                                                    variants. <br>
                                                    The scalable
                                                    implementation (with
                                                    respect to memory
                                                    usage) is selected
                                                    via the second
                                                    option
                                                    -matptap_scalable.</div>
                                                  <div><br>
                                                  </div>
                                                  <div>Try it out - I
                                                    see a significant
                                                    memory reduction
                                                    using these options
                                                    for particular mesh
                                                    sizes / partitions.<br>
                                                  </div>
                                                  <div><br>
                                                  </div>
                                                  I've attached a
                                                  cleaned up version of
                                                  the code you sent me.<br>
                                                </div>
                                                There were a number of
                                                memory leaks and other
                                                issues.<br>
                                              </div>
                                              <div>The main points being<br>
                                              </div>
                                                * You should call
                                              DMDAVecGetArrayF90()
                                              before
                                              VecAssembly{Begin,End}<br>
                                                * You should call
                                              PetscFinalize(), otherwise
                                              the option -log_summary
                                              (-log_view) will not
                                              display anything once the
                                              program has completed.<br>
                                              <div>
                                                <div>
                                                  <div><br>
                                                    <br>
                                                  </div>
                                                  <div>Thanks,<br>
                                                  </div>
                                                  <div>  Dave<br>
                                                  </div>
                                                  <div>
                                                    <div>
                                                      <div><br>
                                                      </div>
                                                    </div>
                                                  </div>
                                                </div>
                                              </div>
                                            </div>
                                            <div class="gmail_extra"><br>
                                              <div class="gmail_quote">On
                                                15 September 2016 at
                                                08:03, Hengjie Wang <span dir="ltr"><<a>hengjiew@uci.edu</a>></span>
                                                wrote:<br>
                                                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                                  <div bgcolor="#FFFFFF">
                                                    Hi Dave,<br>
                                                    <br>
                                                    Sorry, I should have
                                                    put more comment to
                                                    explain the code.  <br>
                                                    The number of
                                                    process in each
                                                    dimension is the
                                                    same: Px = Py=Pz=P.
                                                    So is the domain
                                                    size.<br>
                                                    So if the you want
                                                    to run the code for
                                                    a  512^3 grid points
                                                    on 16^3 cores, you
                                                    need to set "-N 512
                                                    -P 16" in the
                                                    command line.<br>
                                                    I add more comments
                                                    and also fix an
                                                    error in the
                                                    attached code. ( The
                                                    error only effects
                                                    the accuracy of
                                                    solution but not the
                                                    memory usage. ) <br>
                                                    <div><br>
                                                      Thank you.<span><font color="#888888"><br>
                                                          Frank</font></span>
                                                      <div>
                                                        <div><br>
                                                          <br>
                                                          On 9/14/2016
                                                          9:05 PM, Dave
                                                          May wrote:<br>
                                                        </div>
                                                      </div>
                                                    </div>
                                                    <div>
                                                      <div>
                                                        <blockquote type="cite"><br>
                                                          <br>
                                                          On Thursday,
                                                          15 September
                                                          2016, Dave May
                                                          <<a>dave.mayhem23@gmail.com</a>>
                                                          wrote:<br>
                                                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
                                                          <br>
                                                          On Thursday,
                                                          15 September
                                                          2016, frank
                                                          <<a>hengjiew@uci.edu</a>>
                                                          wrote:<br>
                                                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                                          <div bgcolor="#FFFFFF">
                                                          Hi, <br>
                                                          <br>
                                                          I write a
                                                          simple code to
                                                          re-produce the
                                                          error. I hope
                                                          this can help
                                                          to diagnose
                                                          the problem.<br>
                                                          The code just
                                                          solves a 3d
                                                          poisson
                                                          equation. </div>
                                                          </blockquote>
                                                          <div><br>
                                                          </div>
                                                          <div>Why is
                                                          the stencil
                                                          width a
                                                          runtime
                                                          parameter??
                                                          And why is the
                                                          default value
                                                          2? For 7-pnt
                                                          FD Laplace,
                                                          you only need
                                                          a stencil
                                                          width of 1. </div>
                                                          <div><br>
                                                          </div>
                                                          <div>Was this
                                                          choice made to
                                                          mimic
                                                          something in
                                                          the
                                                          real application
                                                          code?</div>
                                                          </blockquote>
                                                          <div><br>
                                                          </div>
                                                          Please ignore
                                                          - I
                                                          misunderstood your
                                                          usage of the
                                                          param set by
                                                          -P
                                                          <div>
                                                          <div> </div>
                                                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                                          <div> </div>
                                                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                                          <div bgcolor="#FFFFFF"><br>
                                                          I run the code
                                                          on a 1024^3
                                                          mesh. The
                                                          process
                                                          partition is
                                                          32 * 32 * 32.
                                                          That's when I
                                                          re-produce the
                                                          OOM error.
                                                          Each core has
                                                          about 2G
                                                          memory.<br>
                                                          I also run the
                                                          code on a
                                                          512^3 mesh
                                                          with 16 * 16 *
                                                          16 processes.
                                                          The ksp solver
                                                          works fine. <br>
                                                          I attached the
                                                          code,
                                                          ksp_view_pre's
                                                          output and my
                                                          petsc option
                                                          file.<br>
                                                          <br>
                                                          Thank you.<br>
                                                          Frank<br>
                                                          <div><br>
                                                          On 09/09/2016
                                                          06:38 PM,
                                                          Hengjie Wang
                                                          wrote:<br>
                                                          </div>
                                                          <blockquote type="cite">Hi
                                                          Barry, 
                                                          <div><br>
                                                          </div>
                                                          <div>I
                                                          checked. On
                                                          the
                                                          supercomputer,
                                                          I had the
                                                          option
                                                          "-ksp_view_pre"
                                                          but it is not
                                                          in file I sent
                                                          you. I am
                                                          sorry for the
                                                          confusion.</div>
                                                          <div><br>
                                                          </div>
                                                          <div>Regards,</div>
                                                          <div>Frank<span></span><br>
                                                          <br>
                                                          On Friday,
                                                          September 9,
                                                          2016, Barry
                                                          Smith <<a>bsmith@mcs.anl.gov</a>>
                                                          wrote:<br>
                                                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
                                                          > On Sep 9,
                                                          2016, at 3:11
                                                          PM, frank <<a>hengjiew@uci.edu</a>> wrote:<br>
                                                          ><br>
                                                          > Hi Barry,<br>
                                                          ><br>
                                                          > I think
                                                          the first KSP
                                                          view output is
                                                          from
                                                          -ksp_view_pre.
                                                          Before I
                                                          submitted the
                                                          test, I was
                                                          not sure
                                                          whether there
                                                          would be OOM
                                                          error or not.
                                                          So I added
                                                          both
                                                          -ksp_view_pre
                                                          and -ksp_view.<br>
                                                          <br>
                                                            But the
                                                          options file
                                                          you sent
                                                          specifically
                                                          does NOT list
                                                          the
                                                          -ksp_view_pre
                                                          so how could
                                                          it be from
                                                          that?<br>
                                                          <br>
                                                             Sorry to be
                                                          pedantic but
                                                          I've spent too
                                                          much time in
                                                          the past
                                                          trying to
                                                          debug from
                                                          incorrect
                                                          information
                                                          and want to
                                                          make sure that
                                                          the
                                                          information I
                                                          have is
                                                          correct before
                                                          thinking.
                                                          Please recheck
                                                          exactly what
                                                          happened.
                                                          Rerun with the
                                                          exact input
                                                          file you
                                                          emailed if
                                                          that is
                                                          needed.<br>
                                                          <br>
                                                             Barry<br>
                                                          <br>
                                                          ><br>
                                                          > Frank<br>
                                                          ><br>
                                                          ><br>
                                                          > On
                                                          09/09/2016
                                                          12:38 PM,
                                                          Barry Smith
                                                          wrote:<br>
                                                          >>   Why
                                                          does
                                                          ksp_view2.txt
                                                          have two KSP
                                                          views in it
                                                          while
                                                          ksp_view1.txt
                                                          has only one
                                                          KSPView in it?
                                                          Did you run
                                                          two different
                                                          solves in the
                                                          2 case but not
                                                          the one?<br>
                                                          >><br>
                                                          >> 
                                                           Barry<br>
                                                          >><br>
                                                          >><br>
                                                          >><br>
                                                          >>>
                                                          On Sep 9,
                                                          2016, at 10:56
                                                          AM, frank <<a>hengjiew@uci.edu</a>> wrote:<br>
                                                          >>><br>
                                                          >>>
                                                          Hi,<br>
                                                          >>><br>
                                                          >>> I
                                                          want to
                                                          continue
                                                          digging into
                                                          the memory
                                                          problem here.<br>
                                                          >>> I
                                                          did find a
                                                          work around in
                                                          the past,
                                                          which is to
                                                          use less cores
                                                          per node so
                                                          that each core
                                                          has 8G memory.
                                                          However this
                                                          is deficient
                                                          and expensive.
                                                          I hope to
                                                          locate the
                                                          place that
                                                          uses the most
                                                          memory.<br>
                                                          >>><br>
                                                          >>>
                                                          Here is a
                                                          brief summary
                                                          of the tests I
                                                          did in past:<br>
>>>> Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12<br>
                                                          >>>
                                                          Maximum (over
                                                          computational
                                                          time) process
                                                          memory:       
                                                             total
                                                          7.0727e+08<br>
                                                          >>>
                                                          Current
                                                          process
                                                          memory:       
                                                                       
                                                                       
                                                                       
                                                                 total
                                                          7.0727e+08<br>
                                                          >>>
                                                          Maximum (over
                                                          computational
                                                          time) space
                                                          PetscMalloc()ed: 
                                                          total
                                                          6.3908e+11<br>
                                                          >>>
                                                          Current space
PetscMalloc()ed:                                                total
                                                          1.8275e+09<br>
                                                          >>><br>
>>>> Test2:    Mesh 1536*128*384  |  Process Mesh 96*8*24<br>
                                                          >>>
                                                          Maximum (over
                                                          computational
                                                          time) process
                                                          memory:       
                                                             total
                                                          5.9431e+09<br>
                                                          >>>
                                                          Current
                                                          process
                                                          memory:       
                                                                       
                                                                       
                                                                       
                                                                 total
                                                          5.9431e+09<br>
                                                          >>>
                                                          Maximum (over
                                                          computational
                                                          time) space
                                                          PetscMalloc()ed: 
                                                          total
                                                          5.3202e+12<br>
                                                          >>>
                                                          Current space
PetscMalloc()ed:                                                 total
                                                          5.4844e+09<br>
                                                          >>><br>
>>>> Test3:    Mesh 3072*256*768  |  Process Mesh 96*8*24<br>
                                                          >>> 
                                                             OOM( Out Of
                                                          Memory )
                                                          killer of the
                                                          supercomputer
                                                          terminated the
                                                          job during
                                                          "KSPSolve".<br>
                                                          >>><br>
                                                          >>> I
                                                          attached the
                                                          output of
                                                          ksp_view( the
                                                          third test's
                                                          output is from
                                                          ksp_view_pre
                                                          ), memory_view
                                                          and also the
                                                          petsc options.<br>
                                                          >>><br>
                                                          >>>
                                                          In all the
                                                          tests, each
                                                          core can
                                                          access about
                                                          2G memory. In
                                                          test3, there
                                                          are 4223139840
                                                          non-zeros in
                                                          the matrix.
                                                          This will
                                                          consume about
                                                          1.74M, using
                                                          double
                                                          precision.
                                                          Considering
                                                          some extra
                                                          memory used to
                                                          store integer
                                                          index, 2G
                                                          memory should
                                                          still be way
                                                          enough.<br>
                                                          >>><br>
                                                          >>>
                                                          Is there a way
                                                          to find out
                                                          which part of
                                                          KSPSolve uses
                                                          the most
                                                          memory?<br>
                                                          >>>
                                                          Thank you so
                                                          much.<br>
                                                          >>><br>
                                                          >>>
                                                          BTW, there are
                                                          4 options
                                                          remains unused
                                                          and I don't
                                                          understand why
                                                          they are
                                                          omitted:<br>
                                                          >>>
                                                          -mg_coarse_telescope_mg_coarse<wbr>_ksp_type
                                                          value: preonly<br>
                                                          >>>
                                                          -mg_coarse_telescope_mg_coarse<wbr>_pc_type
                                                          value: bjacobi<br>
                                                          >>>
                                                          -mg_coarse_telescope_mg_levels<wbr>_ksp_max_it
                                                          value: 1<br>
                                                          >>>
                                                          -mg_coarse_telescope_mg_levels<wbr>_ksp_type
                                                          value:
                                                          richardson<br>
                                                          >>><br>
                                                          >>><br>
                                                          >>>
                                                          Regards,<br>
                                                          >>>
                                                          Frank<br>
                                                          >>><br>
                                                          >>>
                                                          On 07/13/2016
                                                          05:47 PM, Dave
                                                          May wrote:<br>
>>>><br>
>>>> On 14 July 2016 at 01:07, frank <<a>hengjiew@uci.edu</a>>
                                                          wrote:<br>
>>>> Hi Dave,<br>
>>>><br>
>>>> Sorry for the late reply.<br>
>>>> Thank you so much for your detailed reply.<br>
>>>><br>
>>>> I have a question about the estimation of the memory
                                                          usage. There
                                                          are 4223139840
                                                          allocated
                                                          non-zeros and
                                                          18432 MPI
                                                          processes.
                                                          Double
                                                          precision is
                                                          used. So the
                                                          memory per
                                                          process is:<br>
>>>>   4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?<br>
>>>> Did I do sth wrong here? Because this seems too small.<br>
>>>><br>
>>>> No - I totally f***ed it up. You are correct. That'll
                                                          teach me for
                                                          fumbling
                                                          around with my
                                                          iphone
                                                          calculator and
                                                          not using my
                                                          brain. (Note
                                                          that to
                                                          convert to MB
                                                          just divide by
                                                          1e6, not
                                                          1024^2 -
                                                          although I
                                                          apparently
                                                          cannot convert
                                                          between units
                                                          correctly....)<br>
>>>><br>
>>>> From the PETSc objects associated with the solver, It
                                                          looks like it
                                                          _should_ run
                                                          with 2GB per
                                                          MPI rank.
                                                          Sorry for my
                                                          mistake.
                                                          Possibilities
                                                          are: somewhere
                                                          in your usage
                                                          of PETSc
                                                          you've
                                                          introduced a
                                                          memory leak;
                                                          PETSc is doing
                                                          a huge over
                                                          allocation
                                                          (e.g. as per
                                                          our discussion
                                                          of MatPtAP);
                                                          or in your
                                                          application
                                                          code there are
                                                          other objects
                                                          you have
                                                          forgotten to
                                                          log the memory
                                                          for.<br>
>>>><br>
>>>><br>
>>>><br>
>>>> I am running this job on Bluewater<br>
>>>> I am using the 7 points FD stencil in 3D.<br>
>>>><br>
>>>> I thought so on both counts.<br>
>>>><br>
>>>> I apologize that I made a stupid mistake in computing
                                                          the memory per
                                                          core. My
                                                          settings
                                                          render each
                                                          core can
                                                          access only 2G
                                                          memory on
                                                          average
                                                          instead of 8G
                                                          which I
                                                          mentioned in
                                                          previous
                                                          email. I
                                                          re-run the job
                                                          with 8G memory
                                                          per core on
                                                          average and
                                                          there is no
                                                          "Out Of
                                                          Memory" error.
                                                          I would do
                                                          more test to
                                                          see if there
                                                          is still some
                                                          memory issue.<br>
>>>><br>
>>>> Ok. I'd still like to know where the memory was being
                                                          used since my
                                                          estimates were
                                                          off.<br>
>>>><br>
>>>><br>
>>>> Thanks,<br>
>>>>   Dave<br>
>>>><br>
>>>> Regards,<br>
>>>> Frank<br>
>>>><br>
>>>><br>
>>>><br>
>>>> On 07/11/2016 01:18 PM, Dave May wrote:<br>
>>>>> Hi Frank,<br>
>>>>><br>
>>>>><br>
>>>>> On 11 July 2016 at 19:14, frank <<a>hengjiew@uci.edu</a>>
                                                          wrote:<br>
>>>>> Hi Dave,<br>
>>>>><br>
>>>>> I re-run the test using bjacobi as the
                                                          preconditioner
                                                          on the coarse
                                                          mesh of
                                                          telescope. The
                                                          Grid is
                                                          3072*256*768
                                                          and process
                                                          mesh is
                                                          96*8*24. The
                                                          petsc option
                                                          file is
                                                          attached.<br>
>>>>> I still got the "Out Of Memory" error. The error
                                                          occurred
                                                          before the
                                                          linear solver
                                                          finished one
                                                          step. So I
                                                          don't have the
                                                          full info from
                                                          ksp_view. The
                                                          info from
                                                          ksp_view_pre
                                                          is attached.<br>
>>>>><br>
>>>>> Okay - that is essentially useless (sorry)<br>
>>>>><br>
>>>>> It seems to me that the error occurred when the
                                                          decomposition
                                                          was going to
                                                          be changed.<br>
>>>>><br>
>>>>> Based on what information?<br>
>>>>> Running with -info would give us more clues, but
                                                          will create a
                                                          ton of output.<br>
>>>>> Please try running the case which failed with -info<br>
>>>>>  I had another test with a grid of 1536*128*384 and
                                                          the same
                                                          process mesh
                                                          as above.
                                                          There was no
                                                          error. The
                                                          ksp_view info
                                                          is attached
                                                          for
                                                          comparison.<br>
>>>>> Thank you.<br>
>>>>><br>
>>>>><br>
>>>>> [3] Here is my crude estimate of your memory usage.<br>
>>>>> I'll target the biggest memory hogs only to get an
                                                          order of
                                                          magnitude
                                                          estimate<br>
>>>>><br>
>>>>> * The Fine grid operator contains 4223139840
                                                          non-zeros
                                                          --> 1.8 GB
                                                          per MPI rank
                                                          assuming
                                                          double
                                                          precision.<br>
>>>>> The indices for the AIJ could amount to another 0.3
                                                          GB (assuming
                                                          32 bit
                                                          integers)<br>
>>>>><br>
>>>>> * You use 5 levels of coarsening, so the other
                                                          operators
                                                          should
                                                          represent
                                                          (collectively)<br>
>>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300 MB per
                                                          MPI rank on
                                                          the
                                                          communicator
                                                          with 18432
                                                          ranks.<br>
>>>>> The coarse grid should consume ~ 0.5 MB per MPI
                                                          rank on the
                                                          communicator
                                                          with 18432
                                                          ranks.<br>
>>>>><br>
>>>>> * You use a reduction factor of 64, making the new
                                                          communicator
                                                          with 288 MPI
                                                          ranks.<br>
>>>>> PCTelescope will first gather a temporary matrix
                                                          associated
                                                          with your
                                                          coarse level
                                                          operator
                                                          assuming a
                                                          comm size of
                                                          288 living on
                                                          the comm with
                                                          size 18432.<br>
>>>>> This matrix will require approximately 0.5 * 64 =
                                                          32 MB per core
                                                          on the 288
                                                          ranks.<br>
>>>>> This matrix is then used to form a new MPIAIJ
                                                          matrix on the
                                                          subcomm, thus
                                                          require
                                                          another 32 MB
                                                          per rank.<br>
>>>>> The temporary matrix is now destroyed.<br>
>>>>><br>
>>>>> * Because a DMDA is detected, a permutation matrix
                                                          is assembled.<br>
>>>>> This requires 2 doubles per point in the DMDA.<br>
>>>>> Your coarse DMDA contains 92 x 16 x 48 points.<br>
>>>>> Thus the permutation matrix will require < 1 MB
                                                          per MPI rank
                                                          on the
                                                          sub-comm.<br>
>>>>><br>
>>>>> * Lastly, the matrix is permuted. This uses
                                                          MatPtAP(), but
                                                          the resulting
                                                          operator will
                                                          have the same
                                                          memory
                                                          footprint as
                                                          the unpermuted
                                                          matrix (32
                                                          MB). At any
                                                          stage in
                                                          PCTelescope,
                                                          only 2
                                                          operators of
                                                          size 32 MB are
                                                          held in memory
                                                          when the DMDA
                                                          is provided.<br>
>>>>><br>
>>>>> From my rough estimates, the worst case memory foot
                                                          print for any
                                                          given core,
                                                          given your
                                                          options is
                                                          approximately<br>
>>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB<br>
>>>>> This is way below 8 GB.<br>
>>>>><br>
>>>>> Note this estimate completely ignores:<br>
>>>>> (1) the memory required for the restriction
                                                          operator,<br>
>>>>> (2) the potential growth in the number of non-zeros
                                                          per row due to
                                                          Galerkin
                                                          coarsening (I
                                                          wished
                                                          -ksp_view_pre
                                                          reported the
                                                          output from
                                                          MatView so we
                                                          could see the
                                                          number of
                                                          non-zeros
                                                          required by
                                                          the coarse
                                                          level
                                                          operators)<br>
>>>>> (3) all temporary vectors required by the CG
                                                          solver, and
                                                          those required
                                                          by the
                                                          smoothers.<br>
>>>>> (4) internal memory allocated by MatPtAP<br>
>>>>> (5) memory associated with IS's used within
                                                          PCTelescope<br>
>>>>><br>
>>>>> So either I am completely off in my estimates, or
                                                          you have not
                                                          carefully
                                                          estimated the
                                                          memory usage
                                                          of your
                                                          application
                                                          code.
                                                          Hopefully
                                                          others might
                                                          examine/correct
                                                          my rough
                                                          estimates<br>
>>>>><br>
>>>>> Since I don't have your code I cannot access the
                                                          latter.<br>
>>>>> Since I don't have access to the same machine you
                                                          are running
                                                          on, I think we
                                                          need to take a
                                                          step back.<br>
>>>>><br>
>>>>> [1] What machine are you running on? Send me a URL
                                                          if its
                                                          available<br>
>>>>><br>
>>>>> [2] What discretization are you using? (I am
                                                          guessing a
                                                          scalar 7 point
                                                          FD stencil)<br>
>>>>> If it's a 7 point FD stencil, we should be able to
                                                          examine the
                                                          memory usage
                                                          of your solver
                                                          configuration
                                                          using a
                                                          standard,
                                                          light weight
                                                          existing PETSc
                                                          example, run
                                                          on your
                                                          machine at the
                                                          same scale.<br>
>>>>> This would hopefully enable us to correctly
                                                          evaluate the
                                                          actual memory
                                                          usage required
                                                          by the solver
                                                          configuration
                                                          you are using.<br>
>>>>><br>
>>>>> Thanks,<br>
>>>>>   Dave<br>
>>>>><br>
>>>>><br>
>>>>> Frank<br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>> On 07/08/2016 10:38 PM, Dave May wrote:<br>
>>>>>><br>
>>>>>> On Saturday, 9 July 2016, frank <<a>hengjiew@uci.edu</a>>
                                                          wrote:<br>
>>>>>> Hi Barry and Dave,<br>
>>>>>><br>
>>>>>> Thank both of you for the advice.<br>
>>>>>><br>
>>>>>> @Barry<br>
>>>>>> I made a mistake in the file names in last
                                                          email. I
                                                          attached the
                                                          correct files
                                                          this time.<br>
>>>>>> For all the three tests, 'Telescope' is used as
                                                          the coarse
                                                          preconditioner.<br>
>>>>>><br>
>>>>>> == Test1:   Grid: 1536*128*384,   Process Mesh:
                                                          48*4*12<br>
>>>>>> Part of the memory usage:  Vector   125       
                                                              124
                                                          3971904     0.<br>
>>>>>>                                             
                                                          Matrix   101
                                                          101     
                                                          9462372     0<br>
>>>>>><br>
>>>>>> == Test2: Grid: 1536*128*384,   Process Mesh:
                                                          96*8*24<br>
>>>>>> Part of the memory usage:  Vector   125       
                                                              124
                                                          681672     0.<br>
>>>>>>                                             
                                                          Matrix   101
                                                          101     
                                                          1462180     0.<br>
>>>>>><br>
>>>>>> In theory, the memory usage in Test1 should be
                                                          8 times of
                                                          Test2. In my
                                                          case, it is
                                                          about 6 times.<br>
>>>>>><br>
>>>>>> == Test3: Grid: 3072*256*768,   Process Mesh:
                                                          96*8*24.
                                                          Sub-domain per
                                                          process:
                                                          32*32*32<br>
>>>>>> Here I get the out of memory error.<br>
>>>>>><br>
>>>>>> I tried to use -mg_coarse jacobi. In this way,
                                                          I don't need
                                                          to set
                                                          -mg_coarse_ksp_type
                                                          and
                                                          -mg_coarse_pc_type
                                                          explicitly,
                                                          right?<br>
>>>>>> The linear solver didn't work in this case.
                                                          Petsc output
                                                          some errors.<br>
>>>>>><br>
>>>>>> @Dave<br>
>>>>>> In test3, I use only one instance of
                                                          'Telescope'.
                                                          On the coarse
                                                          mesh of
                                                          'Telescope', I
                                                          used LU as the
                                                          preconditioner
                                                          instead of
                                                          SVD.<br>
>>>>>> If my set the levels correctly, then on the
                                                          last coarse
                                                          mesh of MG
                                                          where it calls
                                                          'Telescope',
                                                          the sub-domain
                                                          per process is
                                                          2*2*2.<br>
>>>>>> On the last coarse mesh of 'Telescope', there
                                                          is only one
                                                          grid point per
                                                          process.<br>
>>>>>> I still got the OOM error. The detailed petsc
                                                          option file is
                                                          attached.<br>
>>>>>><br>
>>>>>> Do you understand the expected memory usage for
                                                          the particular
                                                          parallel LU
                                                          implementation
                                                          you are using?
                                                          I don't
                                                          (seriously).
                                                          Replace LU
                                                          with bjacobi
                                                          and re-run
                                                          this test. My
                                                          point about
                                                          solver
                                                          debugging is
                                                          still valid.<br>
>>>>>><br>
>>>>>> And please send the result of KSPView so we can
                                                          see what is
                                                          actually used
                                                          in the
                                                          computations<br>
>>>>>><br>
>>>>>> Thanks<br>
>>>>>>   Dave<br>
>>>>>><br>
>>>>>><br>
>>>>>> Thank you so much.<br>
>>>>>><br>
>>>>>> Frank<br>
>>>>>><br>
>>>>>><br>
>>>>>><br>
>>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:<br>
>>>>>> On Jul 6, 2016, at 4:19 PM, frank <<a>hengjiew@uci.edu</a>>
                                                          wrote:<br>
>>>>>><br>
>>>>>> Hi Barry,<br>
>>>>>><br>
>>>>>> Thank you for you advice.<br>
>>>>>> I tried three test. In the 1st test, the grid
                                                          is
                                                          3072*256*768
                                                          and the
                                                          process mesh
                                                          is 96*8*24.<br>
>>>>>> The linear solver is 'cg' the preconditioner is
                                                          'mg' and
                                                          'telescope' is
                                                          used as the
                                                          preconditioner
                                                          at the coarse
                                                          mesh.<br>
>>>>>> The system gives me the "Out of Memory" error
                                                          before the
                                                          linear system
                                                          is completely
                                                          solved.<br>
>>>>>> The info from '-ksp_view_pre' is attached. I
                                                          seems to me
                                                          that the error
                                                          occurs when it
                                                          reaches the
                                                          coarse mesh.<br>
>>>>>><br>
>>>>>> The 2nd test uses a grid of 1536*128*384 and
                                                          process mesh
                                                          is 96*8*24.
                                                          The 3rd       
                                                                       
                                                                       
                                                                   test
                                                          uses the same
                                                          grid but a
                                                          different
                                                          process mesh
                                                          48*4*12.<br>
>>>>>>     Are you sure this is right? The total
                                                          matrix and
                                                          vector memory
                                                          usage goes
                                                          from 2nd test<br>
>>>>>>                Vector   384            383     
                                                          8,193,712   
                                                           0.<br>
>>>>>>                Matrix   103            103   
                                                           11,508,688   
                                                           0.<br>
>>>>>> to 3rd test<br>
>>>>>>               Vector   384            383     
                                                          1,590,520   
                                                           0.<br>
>>>>>>                Matrix   103            103     
                                                          3,508,664   
                                                           0.<br>
>>>>>> that is the memory usage got smaller but if you
                                                          have only
                                                          1/8th the
                                                          processes and
                                                          the same grid
                                                          it should have
                                                          gotten about 8
                                                          times bigger.
                                                          Did you maybe
                                                          cut the grid
                                                          by a factor of
                                                          8 also? If so
                                                          that still
                                                          doesn't
                                                          explain it
                                                          because the
                                                          memory usage
                                                          changed by a
                                                          factor of 5
                                                          something for
                                                          the vectors
                                                          and 3
                                                          something for
                                                          the matrices.<br>
>>>>>><br>
>>>>>><br>
>>>>>> The linear solver and petsc options in 2nd and
                                                          3rd tests are
                                                          the same in
                                                          1st test. The
                                                          linear solver
                                                          works fine in
                                                          both test.<br>
>>>>>> I attached the memory usage of the 2nd and 3rd
                                                          tests. The
                                                          memory info is
                                                          from the
                                                          option
                                                          '-log_summary'.
                                                          I tried to use
                                                          '-momery_info'
                                                          as you
                                                          suggested, but
                                                          in my case
                                                          petsc treated
                                                          it as an
                                                          unused option.
                                                          It output
                                                          nothing about
                                                          the memory. Do
                                                          I need to add
                                                          sth to my code
                                                          so I can use
                                                          '-memory_info'?<br>
>>>>>>     Sorry, my mistake the option is
                                                          -memory_view<br>
>>>>>><br>
>>>>>>    Can you run the one case with -memory_view
                                                          and -mg_coarse
                                                          jacobi
                                                          -ksp_max_it 1
                                                          (just so it
                                                          doesn't
                                                          iterate
                                                          forever) to
 </blockquote></div></blockquote></div></blockquote></blockquote></div></blockquote></div></div></div></blockquote></div></div></blockquote></div></blockquote></div></div></blockquote></div></div></div></blockquote></div></div></div></div></div></blockquote></div></div></div></blockquote></div></blockquote><div> </div>