<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <p>Hi,</p>
    This question is follow-up of the thread "Question about memory
    usage in Multigrid preconditioner".<br>
    I used to have the "Out of Memory(OOM)" problem when using the
    CG+Telescope MG solver with 32768 cores. Adding the "-matrap 0;
    -matptap_scalable" option did solve that problem. <br>
    <br>
    Then I test the scalability by solving a 3d poisson eqn for 1 step.
    I used one sub-communicator in all the tests. The difference between
    the petsc options in those tests are: 1 the
    pc_telescope_reduction_factor; 2 the number of multigrid levels in
    the up/down solver. The function "ksp_solve" is timed. It is kind of
    slow and doesn't scale at all. <br>
    <br>
    Test1: 512^3 grid points<br>
    Core#        telescope_reduction_factor        MG levels# for
    up/down solver     Time for KSPSolve (s)<br>
    512             8                                                 4
    / 3                                              6.2466<br>
    4096           64                                               5 /
    3                                              0.9361<br>
    32768         64                                               4 /
    3                                              4.8914<br>
    <br>
    Test2: 1024^3 grid points<br>
    Core#        telescope_reduction_factor        MG levels# for
    up/down solver     Time for KSPSolve (s)<br>
    4096           64                                               5 /
    4                                              3.4139<br>
    8192           128                                             5 /
    4                                              2.4196<br>
    16384         32                                               5 / 3
                                                 5.4150<br>
    32768         64                                               5 /
    3                                              5.6067<br>
    65536         128                                             5 /
    3                                              6.5219<br>
    <br>
    I guess I didn't set the MG levels properly. What would be the
    efficient way to arrange the MG levels?<br>
    Also which preconditionr at the coarse mesh of the 2nd communicator
    should I use to improve the performance? <br>
    <br>
    I attached the test code and the petsc options file for the 1024^3
    cube with 32768 cores. <br>
    <br>
    Thank you.<br>
    <br>
    Regards,<br>
    Frank<br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <div class="moz-cite-prefix">On 09/15/2016 03:35 AM, Dave May wrote:<br>
    </div>
    <blockquote
cite="mid:CAJ98EDpYAVvyJQW3bk_QaiJLQhmEgGn6rz8LYPDDodAh1oErcA@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div>
          <div>
            <div>
              <div>
                <div>HI all,<br>
                  <br>
                </div>
                <div>I the only unexpected memory usage I can see is
                  associated with the call to MatPtAP().<br>
                </div>
                <div>Here is something you can try immediately.<br>
                </div>
              </div>
              Run your code with the additional options<br>
                -matrap 0 -matptap_scalable<br>
              <br>
            </div>
            <div>I didn't realize this before, but the default behaviour
              of MatPtAP in parallel is actually to to explicitly form
              the transpose of P (e.g. assemble R = P^T) and then
              compute R.A.P. <br>
              You don't want to do this. The option -matrap 0 resolves
              this issue.<br>
            </div>
            <div><br>
            </div>
            <div>The implementation of P^T.A.P has two variants. <br>
              The scalable implementation (with respect to memory usage)
              is selected via the second option -matptap_scalable.</div>
            <div><br>
            </div>
            <div>Try it out - I see a significant memory reduction using
              these options for particular mesh sizes / partitions.<br>
            </div>
            <div><br>
            </div>
            I've attached a cleaned up version of the code you sent me.<br>
          </div>
          There were a number of memory leaks and other issues.<br>
        </div>
        <div>The main points being<br>
        </div>
          * You should call DMDAVecGetArrayF90() before
        VecAssembly{Begin,End}<br>
          * You should call PetscFinalize(), otherwise the option
        -log_summary (-log_view) will not display anything once the
        program has completed.<br>
        <div>
          <div>
            <div><br>
              <br>
            </div>
            <div>Thanks,<br>
            </div>
            <div>  Dave<br>
            </div>
            <div>
              <div>
                <div><br>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On 15 September 2016 at 08:03, Hengjie
          Wang <span dir="ltr"><<a moz-do-not-send="true"
              href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div bgcolor="#FFFFFF" text="#000000"> Hi Dave,<br>
              <br>
              Sorry, I should have put more comment to explain the
              code.  <br>
              The number of process in each dimension is the same: Px =
              Py=Pz=P. So is the domain size.<br>
              So if the you want to run the code for a  512^3 grid
              points on 16^3 cores, you need to set "-N 512 -P 16" in
              the command line.<br>
              I add more comments and also fix an error in the attached
              code. ( The error only effects the accuracy of solution
              but not the memory usage. ) <br>
              <div><br>
                Thank you.<span class="HOEnZb"><font color="#888888"><br>
                    Frank</font></span>
                <div>
                  <div class="h5"><br>
                    <br>
                    On 9/14/2016 9:05 PM, Dave May wrote:<br>
                  </div>
                </div>
              </div>
              <div>
                <div class="h5">
                  <blockquote type="cite"><br>
                    <br>
                    On Thursday, 15 September 2016, Dave May <<a
                      moz-do-not-send="true"
                      href="mailto:dave.mayhem23@gmail.com"
                      target="_blank">dave.mayhem23@gmail.com</a>>
                    wrote:<br>
                    <blockquote class="gmail_quote" style="margin:0 0 0
                      .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
                      <br>
                      On Thursday, 15 September 2016, frank <<a
                        moz-do-not-send="true">hengjiew@uci.edu</a>>
                      wrote:<br>
                      <blockquote class="gmail_quote" style="margin:0 0
                        0 .8ex;border-left:1px #ccc
                        solid;padding-left:1ex">
                        <div bgcolor="#FFFFFF" text="#000000"> Hi, <br>
                          <br>
                          I write a simple code to re-produce the error.
                          I hope this can help to diagnose the problem.<br>
                          The code just solves a 3d poisson equation. </div>
                      </blockquote>
                      <div><br>
                      </div>
                      <div>Why is the stencil width a runtime
                        parameter?? And why is the default value 2? For
                        7-pnt FD Laplace, you only need a stencil width
                        of 1. </div>
                      <div><br>
                      </div>
                      <div>Was this choice made to mimic something in
                        the real application code?</div>
                    </blockquote>
                    <div><br>
                    </div>
                    Please ignore - I misunderstood your usage of the
                    param set by -P
                    <div>
                      <div> </div>
                      <blockquote class="gmail_quote" style="margin:0 0
                        0 .8ex;border-left:1px #ccc
                        solid;padding-left:1ex">
                        <div> </div>
                        <blockquote class="gmail_quote" style="margin:0
                          0 0 .8ex;border-left:1px #ccc
                          solid;padding-left:1ex">
                          <div bgcolor="#FFFFFF" text="#000000"><br>
                            I run the code on a 1024^3 mesh. The process
                            partition is 32 * 32 * 32. That's when I
                            re-produce the OOM error. Each core has
                            about 2G memory.<br>
                            I also run the code on a 512^3 mesh with 16
                            * 16 * 16 processes. The ksp solver works
                            fine. <br>
                            I attached the code, ksp_view_pre's output
                            and my petsc option file.<br>
                            <br>
                            Thank you.<br>
                            Frank<br>
                            <div><br>
                              On 09/09/2016 06:38 PM, Hengjie Wang
                              wrote:<br>
                            </div>
                            <blockquote type="cite">Hi Barry, 
                              <div><br>
                              </div>
                              <div>I checked. On the supercomputer, I
                                had the option "-ksp_view_pre" but it is
                                not in file I sent you. I am sorry for
                                the confusion.</div>
                              <div><br>
                              </div>
                              <div>Regards,</div>
                              <div>Frank<span></span><br>
                                <br>
                                On Friday, September 9, 2016, Barry
                                Smith <<a moz-do-not-send="true">bsmith@mcs.anl.gov</a>>
                                wrote:<br>
                                <blockquote class="gmail_quote"
                                  style="margin:0 0 0
                                  .8ex;border-left:1px #ccc
                                  solid;padding-left:1ex"><br>
                                  > On Sep 9, 2016, at 3:11 PM, frank
                                  <<a moz-do-not-send="true">hengjiew@uci.edu</a>>
                                  wrote:<br>
                                  ><br>
                                  > Hi Barry,<br>
                                  ><br>
                                  > I think the first KSP view output
                                  is from -ksp_view_pre. Before I
                                  submitted the test, I was not sure
                                  whether there would be OOM error or
                                  not. So I added both -ksp_view_pre and
                                  -ksp_view.<br>
                                  <br>
                                    But the options file you sent
                                  specifically does NOT list the
                                  -ksp_view_pre so how could it be from
                                  that?<br>
                                  <br>
                                     Sorry to be pedantic but I've spent
                                  too much time in the past trying to
                                  debug from incorrect information and
                                  want to make sure that the information
                                  I have is correct before thinking.
                                  Please recheck exactly what happened.
                                  Rerun with the exact input file you
                                  emailed if that is needed.<br>
                                  <br>
                                     Barry<br>
                                  <br>
                                  ><br>
                                  > Frank<br>
                                  ><br>
                                  ><br>
                                  > On 09/09/2016 12:38 PM, Barry
                                  Smith wrote:<br>
                                  >>   Why does ksp_view2.txt have
                                  two KSP views in it while
                                  ksp_view1.txt has only one KSPView in
                                  it? Did you run two different solves
                                  in the 2 case but not the one?<br>
                                  >><br>
                                  >>   Barry<br>
                                  >><br>
                                  >><br>
                                  >><br>
                                  >>> On Sep 9, 2016, at 10:56
                                  AM, frank <<a
                                    moz-do-not-send="true">hengjiew@uci.edu</a>>
                                  wrote:<br>
                                  >>><br>
                                  >>> Hi,<br>
                                  >>><br>
                                  >>> I want to continue
                                  digging into the memory problem here.<br>
                                  >>> I did find a work around
                                  in the past, which is to use less
                                  cores per node so that each core has
                                  8G memory. However this is deficient
                                  and expensive. I hope to locate the
                                  place that uses the most memory.<br>
                                  >>><br>
                                  >>> Here is a brief summary
                                  of the tests I did in past:<br>
                                  >>>> Test1:   Mesh
                                  1536*128*384  |  Process Mesh 48*4*12<br>
                                  >>> Maximum (over
                                  computational time) process memory:   
                                         total 7.0727e+08<br>
                                  >>> Current process memory: 
                                                                       
                                                   total 7.0727e+08<br>
                                  >>> Maximum (over
                                  computational time) space
                                  PetscMalloc()ed:  total 6.3908e+11<br>
                                  >>> Current space
                                  PetscMalloc()ed:                     
                                                            total
                                  1.8275e+09<br>
                                  >>><br>
                                  >>>> Test2:    Mesh
                                  1536*128*384  |  Process Mesh 96*8*24<br>
                                  >>> Maximum (over
                                  computational time) process memory:   
                                         total 5.9431e+09<br>
                                  >>> Current process memory: 
                                                                       
                                                   total 5.9431e+09<br>
                                  >>> Maximum (over
                                  computational time) space
                                  PetscMalloc()ed:  total 5.3202e+12<br>
                                  >>> Current space
                                  PetscMalloc()ed:                     
                                                             total
                                  5.4844e+09<br>
                                  >>><br>
                                  >>>> Test3:    Mesh
                                  3072*256*768  |  Process Mesh 96*8*24<br>
                                  >>>     OOM( Out Of Memory )
                                  killer of the supercomputer terminated
                                  the job during "KSPSolve".<br>
                                  >>><br>
                                  >>> I attached the output of
                                  ksp_view( the third test's output is
                                  from ksp_view_pre ), memory_view and
                                  also the petsc options.<br>
                                  >>><br>
                                  >>> In all the tests, each
                                  core can access about 2G memory. In
                                  test3, there are 4223139840 non-zeros
                                  in the matrix. This will consume about
                                  1.74M, using double precision.
                                  Considering some extra memory used to
                                  store integer index, 2G memory should
                                  still be way enough.<br>
                                  >>><br>
                                  >>> Is there a way to find
                                  out which part of KSPSolve uses the
                                  most memory?<br>
                                  >>> Thank you so much.<br>
                                  >>><br>
                                  >>> BTW, there are 4 options
                                  remains unused and I don't understand
                                  why they are omitted:<br>
                                  >>>
                                  -mg_coarse_telescope_mg_coarse<wbr>_ksp_type
                                  value: preonly<br>
                                  >>>
                                  -mg_coarse_telescope_mg_coarse<wbr>_pc_type
                                  value: bjacobi<br>
                                  >>>
                                  -mg_coarse_telescope_mg_levels<wbr>_ksp_max_it
                                  value: 1<br>
                                  >>>
                                  -mg_coarse_telescope_mg_levels<wbr>_ksp_type
                                  value: richardson<br>
                                  >>><br>
                                  >>><br>
                                  >>> Regards,<br>
                                  >>> Frank<br>
                                  >>><br>
                                  >>> On 07/13/2016 05:47 PM,
                                  Dave May wrote:<br>
                                  >>>><br>
                                  >>>> On 14 July 2016 at
                                  01:07, frank <<a
                                    moz-do-not-send="true">hengjiew@uci.edu</a>>
                                  wrote:<br>
                                  >>>> Hi Dave,<br>
                                  >>>><br>
                                  >>>> Sorry for the late
                                  reply.<br>
                                  >>>> Thank you so much for
                                  your detailed reply.<br>
                                  >>>><br>
                                  >>>> I have a question
                                  about the estimation of the memory
                                  usage. There are 4223139840 allocated
                                  non-zeros and 18432 MPI processes.
                                  Double precision is used. So the
                                  memory per process is:<br>
                                  >>>>   4223139840 * 8bytes
                                  / 18432 / 1024 / 1024 = 1.74M ?<br>
                                  >>>> Did I do sth wrong
                                  here? Because this seems too small.<br>
                                  >>>><br>
                                  >>>> No - I totally f***ed
                                  it up. You are correct. That'll teach
                                  me for fumbling around with my iphone
                                  calculator and not using my brain.
                                  (Note that to convert to MB just
                                  divide by 1e6, not 1024^2 - although I
                                  apparently cannot convert between
                                  units correctly....)<br>
                                  >>>><br>
                                  >>>> From the PETSc
                                  objects associated with the solver, It
                                  looks like it _should_ run with 2GB
                                  per MPI rank. Sorry for my mistake.
                                  Possibilities are: somewhere in your
                                  usage of PETSc you've introduced a
                                  memory leak; PETSc is doing a huge
                                  over allocation (e.g. as per our
                                  discussion of MatPtAP); or in your
                                  application code there are other
                                  objects you have forgotten to log the
                                  memory for.<br>
                                  >>>><br>
                                  >>>><br>
                                  >>>><br>
                                  >>>> I am running this job
                                  on Bluewater<br>
                                  >>>> I am using the 7
                                  points FD stencil in 3D.<br>
                                  >>>><br>
                                  >>>> I thought so on both
                                  counts.<br>
                                  >>>><br>
                                  >>>> I apologize that I
                                  made a stupid mistake in computing the
                                  memory per core. My settings render
                                  each core can access only 2G memory on
                                  average instead of 8G which I
                                  mentioned in previous email. I re-run
                                  the job with 8G memory per core on
                                  average and there is no "Out Of
                                  Memory" error. I would do more test to
                                  see if there is still some memory
                                  issue.<br>
                                  >>>><br>
                                  >>>> Ok. I'd still like to
                                  know where the memory was being used
                                  since my estimates were off.<br>
                                  >>>><br>
                                  >>>><br>
                                  >>>> Thanks,<br>
                                  >>>>   Dave<br>
                                  >>>><br>
                                  >>>> Regards,<br>
                                  >>>> Frank<br>
                                  >>>><br>
                                  >>>><br>
                                  >>>><br>
                                  >>>> On 07/11/2016 01:18
                                  PM, Dave May wrote:<br>
                                  >>>>> Hi Frank,<br>
                                  >>>>><br>
                                  >>>>><br>
                                  >>>>> On 11 July 2016
                                  at 19:14, frank <<a
                                    moz-do-not-send="true">hengjiew@uci.edu</a>>
                                  wrote:<br>
                                  >>>>> Hi Dave,<br>
                                  >>>>><br>
                                  >>>>> I re-run the test
                                  using bjacobi as the preconditioner on
                                  the coarse mesh of telescope. The Grid
                                  is 3072*256*768 and process mesh is
                                  96*8*24. The petsc option file is
                                  attached.<br>
                                  >>>>> I still got the
                                  "Out Of Memory" error. The error
                                  occurred before the linear solver
                                  finished one step. So I don't have the
                                  full info from ksp_view. The info from
                                  ksp_view_pre is attached.<br>
                                  >>>>><br>
                                  >>>>> Okay - that is
                                  essentially useless (sorry)<br>
                                  >>>>><br>
                                  >>>>> It seems to me
                                  that the error occurred when the
                                  decomposition was going to be changed.<br>
                                  >>>>><br>
                                  >>>>> Based on what
                                  information?<br>
                                  >>>>> Running with
                                  -info would give us more clues, but
                                  will create a ton of output.<br>
                                  >>>>> Please try
                                  running the case which failed with
                                  -info<br>
                                  >>>>>  I had another
                                  test with a grid of 1536*128*384 and
                                  the same process mesh as above. There
                                  was no error. The ksp_view info is
                                  attached for comparison.<br>
                                  >>>>> Thank you.<br>
                                  >>>>><br>
                                  >>>>><br>
                                  >>>>> [3] Here is my
                                  crude estimate of your memory usage.<br>
                                  >>>>> I'll target the
                                  biggest memory hogs only to get an
                                  order of magnitude estimate<br>
                                  >>>>><br>
                                  >>>>> * The Fine grid
                                  operator contains 4223139840 non-zeros
                                  --> 1.8 GB per MPI rank assuming
                                  double precision.<br>
                                  >>>>> The indices for
                                  the AIJ could amount to another 0.3 GB
                                  (assuming 32 bit integers)<br>
                                  >>>>><br>
                                  >>>>> * You use 5
                                  levels of coarsening, so the other
                                  operators should represent
                                  (collectively)<br>
                                  >>>>> 2.1 / 8 + 2.1/8^2
                                  + 2.1/8^3 + 2.1/8^4  ~ 300 MB per MPI
                                  rank on the communicator with 18432
                                  ranks.<br>
                                  >>>>> The coarse grid
                                  should consume ~ 0.5 MB per MPI rank
                                  on the communicator with 18432 ranks.<br>
                                  >>>>><br>
                                  >>>>> * You use a
                                  reduction factor of 64, making the new
                                  communicator with 288 MPI ranks.<br>
                                  >>>>> PCTelescope will
                                  first gather a temporary matrix
                                  associated with your coarse level
                                  operator assuming a comm size of 288
                                  living on the comm with size 18432.<br>
                                  >>>>> This matrix will
                                  require approximately 0.5 * 64 = 32 MB
                                  per core on the 288 ranks.<br>
                                  >>>>> This matrix is
                                  then used to form a new MPIAIJ matrix
                                  on the subcomm, thus require another
                                  32 MB per rank.<br>
                                  >>>>> The temporary
                                  matrix is now destroyed.<br>
                                  >>>>><br>
                                  >>>>> * Because a DMDA
                                  is detected, a permutation matrix is
                                  assembled.<br>
                                  >>>>> This requires 2
                                  doubles per point in the DMDA.<br>
                                  >>>>> Your coarse DMDA
                                  contains 92 x 16 x 48 points.<br>
                                  >>>>> Thus the
                                  permutation matrix will require < 1
                                  MB per MPI rank on the sub-comm.<br>
                                  >>>>><br>
                                  >>>>> * Lastly, the
                                  matrix is permuted. This uses
                                  MatPtAP(), but the resulting operator
                                  will have the same memory footprint as
                                  the unpermuted matrix (32 MB). At any
                                  stage in PCTelescope, only 2 operators
                                  of size 32 MB are held in memory when
                                  the DMDA is provided.<br>
                                  >>>>><br>
                                  >>>>> From my rough
                                  estimates, the worst case memory foot
                                  print for any given core, given your
                                  options is approximately<br>
                                  >>>>> 2100 MB + 300 MB
                                  + 32 MB + 32 MB + 1 MB  = 2465 MB<br>
                                  >>>>> This is way below
                                  8 GB.<br>
                                  >>>>><br>
                                  >>>>> Note this
                                  estimate completely ignores:<br>
                                  >>>>> (1) the memory
                                  required for the restriction operator,<br>
                                  >>>>> (2) the potential
                                  growth in the number of non-zeros per
                                  row due to Galerkin coarsening (I
                                  wished -ksp_view_pre reported the
                                  output from MatView so we could see
                                  the number of non-zeros required by
                                  the coarse level operators)<br>
                                  >>>>> (3) all temporary
                                  vectors required by the CG solver, and
                                  those required by the smoothers.<br>
                                  >>>>> (4) internal
                                  memory allocated by MatPtAP<br>
                                  >>>>> (5) memory
                                  associated with IS's used within
                                  PCTelescope<br>
                                  >>>>><br>
                                  >>>>> So either I am
                                  completely off in my estimates, or you
                                  have not carefully estimated the
                                  memory usage of your application code.
                                  Hopefully others might examine/correct
                                  my rough estimates<br>
                                  >>>>><br>
                                  >>>>> Since I don't
                                  have your code I cannot access the
                                  latter.<br>
                                  >>>>> Since I don't
                                  have access to the same machine you
                                  are running on, I think we need to
                                  take a step back.<br>
                                  >>>>><br>
                                  >>>>> [1] What machine
                                  are you running on? Send me a URL if
                                  its available<br>
                                  >>>>><br>
                                  >>>>> [2] What
                                  discretization are you using? (I am
                                  guessing a scalar 7 point FD stencil)<br>
                                  >>>>> If it's a 7 point
                                  FD stencil, we should be able to
                                  examine the memory usage of your
                                  solver configuration using a standard,
                                  light weight existing PETSc example,
                                  run on your machine at the same scale.<br>
                                  >>>>> This would
                                  hopefully enable us to correctly
                                  evaluate the actual memory usage
                                  required by the solver configuration
                                  you are using.<br>
                                  >>>>><br>
                                  >>>>> Thanks,<br>
                                  >>>>>   Dave<br>
                                  >>>>><br>
                                  >>>>><br>
                                  >>>>> Frank<br>
                                  >>>>><br>
                                  >>>>><br>
                                  >>>>><br>
                                  >>>>><br>
                                  >>>>> On 07/08/2016
                                  10:38 PM, Dave May wrote:<br>
                                  >>>>>><br>
                                  >>>>>> On Saturday,
                                  9 July 2016, frank <<a
                                    moz-do-not-send="true">hengjiew@uci.edu</a>>
                                  wrote:<br>
                                  >>>>>> Hi Barry and
                                  Dave,<br>
                                  >>>>>><br>
                                  >>>>>> Thank both of
                                  you for the advice.<br>
                                  >>>>>><br>
                                  >>>>>> @Barry<br>
                                  >>>>>> I made a
                                  mistake in the file names in last
                                  email. I attached the correct files
                                  this time.<br>
                                  >>>>>> For all the
                                  three tests, 'Telescope' is used as
                                  the coarse preconditioner.<br>
                                  >>>>>><br>
                                  >>>>>> == Test1: 
                                   Grid: 1536*128*384,   Process Mesh:
                                  48*4*12<br>
                                  >>>>>> Part of the
                                  memory usage:  Vector   125           
                                  124 3971904     0.<br>
                                  >>>>>>             
                                                                 
                                  Matrix   101 101      9462372     0<br>
                                  >>>>>><br>
                                  >>>>>> == Test2:
                                  Grid: 1536*128*384,   Process Mesh:
                                  96*8*24<br>
                                  >>>>>> Part of the
                                  memory usage:  Vector   125           
                                  124 681672     0.<br>
                                  >>>>>>             
                                                                 
                                  Matrix   101 101      1462180     0.<br>
                                  >>>>>><br>
                                  >>>>>> In theory,
                                  the memory usage in Test1 should be 8
                                  times of Test2. In my case, it is
                                  about 6 times.<br>
                                  >>>>>><br>
                                  >>>>>> == Test3:
                                  Grid: 3072*256*768,   Process Mesh:
                                  96*8*24. Sub-domain per process:
                                  32*32*32<br>
                                  >>>>>> Here I get
                                  the out of memory error.<br>
                                  >>>>>><br>
                                  >>>>>> I tried to
                                  use -mg_coarse jacobi. In this way, I
                                  don't need to set -mg_coarse_ksp_type
                                  and -mg_coarse_pc_type explicitly,
                                  right?<br>
                                  >>>>>> The linear
                                  solver didn't work in this case. Petsc
                                  output some errors.<br>
                                  >>>>>><br>
                                  >>>>>> @Dave<br>
                                  >>>>>> In test3, I
                                  use only one instance of 'Telescope'.
                                  On the coarse mesh of 'Telescope', I
                                  used LU as the preconditioner instead
                                  of SVD.<br>
                                  >>>>>> If my set the
                                  levels correctly, then on the last
                                  coarse mesh of MG where it calls
                                  'Telescope', the sub-domain per
                                  process is 2*2*2.<br>
                                  >>>>>> On the last
                                  coarse mesh of 'Telescope', there is
                                  only one grid point per process.<br>
                                  >>>>>> I still got
                                  the OOM error. The detailed petsc
                                  option file is attached.<br>
                                  >>>>>><br>
                                  >>>>>> Do you
                                  understand the expected memory usage
                                  for the particular parallel LU
                                  implementation you are using? I don't
                                  (seriously). Replace LU with bjacobi
                                  and re-run this test. My point about
                                  solver debugging is still valid.<br>
                                  >>>>>><br>
                                  >>>>>> And please
                                  send the result of KSPView so we can
                                  see what is actually used in the
                                  computations<br>
                                  >>>>>><br>
                                  >>>>>> Thanks<br>
                                  >>>>>>   Dave<br>
                                  >>>>>><br>
                                  >>>>>><br>
                                  >>>>>> Thank you so
                                  much.<br>
                                  >>>>>><br>
                                  >>>>>> Frank<br>
                                  >>>>>><br>
                                  >>>>>><br>
                                  >>>>>><br>
                                  >>>>>> On 07/06/2016
                                  02:51 PM, Barry Smith wrote:<br>
                                  >>>>>> On Jul 6,
                                  2016, at 4:19 PM, frank <<a
                                    moz-do-not-send="true">hengjiew@uci.edu</a>>
                                  wrote:<br>
                                  >>>>>><br>
                                  >>>>>> Hi Barry,<br>
                                  >>>>>><br>
                                  >>>>>> Thank you for
                                  you advice.<br>
                                  >>>>>> I tried three
                                  test. In the 1st test, the grid is
                                  3072*256*768 and the process mesh is
                                  96*8*24.<br>
                                  >>>>>> The linear
                                  solver is 'cg' the preconditioner is
                                  'mg' and 'telescope' is used as the
                                  preconditioner at the coarse mesh.<br>
                                  >>>>>> The system
                                  gives me the "Out of Memory" error
                                  before the linear system is completely
                                  solved.<br>
                                  >>>>>> The info from
                                  '-ksp_view_pre' is attached. I seems
                                  to me that the error occurs when it
                                  reaches the coarse mesh.<br>
                                  >>>>>><br>
                                  >>>>>> The 2nd test
                                  uses a grid of 1536*128*384 and
                                  process mesh is 96*8*24. The 3rd     
                                                                       
                                   test uses the same grid but a
                                  different process mesh 48*4*12.<br>
                                  >>>>>>     Are you
                                  sure this is right? The total matrix
                                  and vector memory usage goes from 2nd
                                  test<br>
                                  >>>>>>             
                                    Vector   384            383     
                                  8,193,712     0.<br>
                                  >>>>>>             
                                    Matrix   103            103   
                                   11,508,688     0.<br>
                                  >>>>>> to 3rd test<br>
                                  >>>>>>             
                                   Vector   384            383     
                                  1,590,520     0.<br>
                                  >>>>>>             
                                    Matrix   103            103     
                                  3,508,664     0.<br>
                                  >>>>>> that is the
                                  memory usage got smaller but if you
                                  have only 1/8th the processes and the
                                  same grid it should have gotten about
                                  8 times bigger. Did you maybe cut the
                                  grid by a factor of 8 also? If so that
                                  still doesn't explain it because the
                                  memory usage changed by a factor of 5
                                  something for the vectors and 3
                                  something for the matrices.<br>
                                  >>>>>><br>
                                  >>>>>><br>
                                  >>>>>> The linear
                                  solver and petsc options in 2nd and
                                  3rd tests are the same in 1st test.
                                  The linear solver works fine in both
                                  test.<br>
                                  >>>>>> I attached
                                  the memory usage of the 2nd and 3rd
                                  tests. The memory info is from the
                                  option '-log_summary'. I tried to use
                                  '-momery_info' as you suggested, but
                                  in my case petsc treated it as an
                                  unused option. It output nothing about
                                  the memory. Do I need to add sth to my
                                  code so I can use '-memory_info'?<br>
                                  >>>>>>     Sorry, my
                                  mistake the option is -memory_view<br>
                                  >>>>>><br>
                                  >>>>>>    Can you
                                  run the one case with -memory_view and
                                  -mg_coarse jacobi -ksp_max_it 1 (just
                                  so it doesn't iterate forever) to see
                                  how much memory is used without the
                                  telescope? Also run case 2 the same
                                  way.<br>
                                  >>>>>><br>
                                  >>>>>>    Barry<br>
                                  >>>>>><br>
                                  >>>>>><br>
                                  >>>>>><br>
                                  >>>>>> In both tests
                                  the memory usage is not large.<br>
                                  >>>>>><br>
                                  >>>>>> It seems to
                                  me that it might be the 'telescope' 
                                  preconditioner that allocated a lot of
                                  memory and caused the error in the 1st
                                  test.<br>
                                  >>>>>> Is there is a
                                  way to show how much memory it
                                  allocated?<br>
                                  >>>>>><br>
                                  >>>>>> Frank<br>
                                  >>>>>><br>
                                  >>>>>> On 07/05/2016
                                  03:37 PM, Barry Smith wrote:<br>
                                  >>>>>>    Frank,<br>
                                  >>>>>><br>
                                  >>>>>>      You can
                                  run with -ksp_view_pre to have it
                                  "view" the KSP before the solve so
                                  hopefully it gets that far.<br>
                                  >>>>>><br>
                                  >>>>>>       Please
                                  run the problem that does fit with
                                  -memory_info when the problem
                                  completes it will show the "high water
                                  mark" for PETSc allocated memory and
                                  total memory used. We first want to
                                  look at these numbers to see if it is
                                  using more memory than you expect. You
                                  could also run with say half the grid
                                  spacing to see how the memory usage
                                  scaled with the increase in grid
                                  points. Make the runs also with
                                  -log_view and send all the output from
                                  these options.<br>
                                  >>>>>><br>
                                  >>>>>>     Barry<br>
                                  >>>>>><br>
                                  >>>>>> On Jul 5,
                                  2016, at 5:23 PM, frank <<a
                                    moz-do-not-send="true">hengjiew@uci.edu</a>>
                                  wrote:<br>
                                  >>>>>><br>
                                  >>>>>> Hi,<br>
                                  >>>>>><br>
                                  >>>>>> I am using
                                  the CG ksp solver and Multigrid
                                  preconditioner  to solve a linear
                                  system in parallel.<br>
                                  >>>>>> I chose to
                                  use the 'Telescope' as the
                                  preconditioner on the coarse mesh for
                                  its good performance.<br>
                                  >>>>>> The petsc
                                  options file is attached.<br>
                                  >>>>>><br>
                                  >>>>>> The domain is
                                  a 3d box.<br>
                                  >>>>>> It works well
                                  when the grid is  1536*128*384 and the
                                  process mesh is 96*8*24. When I double
                                  the size of grid and                 
                                                                 keep
                                  the same process mesh and petsc
                                  options, I get an "out of memory"
                                  error from the super-cluster I am
                                  using.<br>
                                  >>>>>> Each process
                                  has access to at least 8G memory,
                                  which should be more than enough for
                                  my application. I am sure that all the
                                  other parts of my code( except the
                                  linear solver ) do not use much
                                  memory. So I doubt if there is
                                  something wrong with the linear
                                  solver.<br>
                                  >>>>>> The error
                                  occurs before the linear system is
                                  completely solved so I don't have the
                                  info from ksp view. I am not able to
                                  re-produce the error with a smaller
                                  problem either.<br>
                                  >>>>>> In addition, 
                                  I tried to use the block jacobi as the
                                  preconditioner with the same grid and
                                  same decomposition. The linear solver
                                  runs extremely slow but there is no
                                  memory error.<br>
                                  >>>>>><br>
                                  >>>>>> How can I
                                  diagnose what exactly cause the error?<br>
                                  >>>>>> Thank you so
                                  much.<br>
                                  >>>>>><br>
                                  >>>>>> Frank<br>
                                  >>>>>>
                                  <petsc_options.txt><br>
                                  >>>>>>
                                  <ksp_view_pre.txt><memory_test<wbr>2.txt><memory_test3.txt><petsc<wbr>_options.txt><br>
                                  >>>>>><br>
                                  >>>>><br>
                                  >>>><br>
                                  >>>
                                  <ksp_view1.txt><ksp_view2.txt><wbr><ksp_view3.txt><memory1.txt><m<wbr>emory2.txt><petsc_options1.txt<wbr>><petsc_options2.txt><petsc_op<wbr>tions3.txt><br>
                                  ><br>
                                  <br>
                                </blockquote>
                              </div>
                            </blockquote>
                            <br>
                          </div>
                        </blockquote>
                        <div> </div>
                      </blockquote>
                    </div>
                  </blockquote>
                  <br>
                </div>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </body>
</html>