<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Oct 6, 2016 at 7:33 PM, frank <span dir="ltr"><<a href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    <p>Dear Dave,</p>
    Follow your advice, I solve the identical equation twice and time
    two steps separately. The result is below:<br>
    <br>
    Test: 1024^3 grid points<br>
    Cores#    reduction factor    MG levels#    time of 1st solve    2nd
    time<br>
    4096            64                        6 + 3                
    3.85                          <wbr>   1.75<br>
    8192          128                       5 + 3                 
    5.52                          <wbr>   0.91<br>
    16384        256                       5 + 3                  5.37  
                              0.52<br>
    32768        512                       5 + 4                 
    3.03                             0.36<br>
    32768     64 | 8                   4 | 3 | 3                  2.80  
                              0.43<br>
    65536      1024                      5 + 4                  3.38    
                             0.59<br>
    65536    32 | 32                  4 | 4 | 3                  2.14  
                              0.22<br>
    <br>
    
    
    
    I also attached the log_view info from all the run.  The file
    is names by the cores# + reduction factor.<br>
    The ksp_view and petsc_options for  the 1st run are also included.
    Others are similar. The only differences are the reduction factor
    and mg levels.<br>
    <br>
    ** The time for the 1st solve is generally much larger. Is this
    because the ksp solver on the sub-communicator is set up during the
    1st solve?<br></div></blockquote><div><br></div><div>All setup is done in the first solve.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
    ** The time for 1st solve does not scale. <br>
        In practice, I am solving a variable coefficient  Poisson
    equation. I need to build the matrix every time step. Therefore,
    each step is similar to the 1st solve which does not scale. Is there
    a way I can improve the performance?<br></div></blockquote><div><br></div><div>You could use rediscretization instead of Galerkin to produce the coarse operators.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
    ** The 2nd solve scales but not quite well for more than 16384
    cores.<br></div></blockquote><div><br></div><div>How well were you looking for? This is strong scaling, which is has an Amdahl's Law limit.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
        It seems to me that the performance depends on the tuning of MG
    levels on the sub-communicator(s).<br>
        Is there some general strategies regarding how to distribute the
    levels? or when to use multiple sub-communicators ? <br></div></blockquote><div><br></div><div>Also, you use CG/MG when FMG by itself would probably be faster. Your smoother is likely not strong enough, and you</div><div>should use something like V(2,2). There is a lot of tuning that is possible, but difficult to automate.</div><div><br></div><div>  Thanks,</div><div><br></div><div>     Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
    Thank you.<br>
        <br>
    Regards,<br>
    Frank<div><div class="h5"><br>
    <br>
    <br>
    <br>
    <br>
    <div class="m_-3012109709631955293moz-cite-prefix">On 10/04/2016 12:56 PM, Dave May wrote:<br>
    </div>
    <blockquote type="cite"><br>
      <br>
      On Tuesday, 4 October 2016, frank <<a href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>> wrote:<br>
      <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
        <div bgcolor="#FFFFFF" text="#000000">
          <p>Hi,</p>
          This question is follow-up of the thread "Question about
          memory usage in Multigrid preconditioner".<br>
          I used to have the "Out of Memory(OOM)" problem when using the
          CG+Telescope MG solver with 32768 cores. Adding the "-matrap
          0; -matptap_scalable" option did solve that problem. <br>
          <br>
          Then I test the scalability by solving a 3d poisson eqn for 1
          step. I used one sub-communicator in all the tests. The
          difference between the petsc options in those tests are: 1 the
          pc_telescope_reduction_factor; 2 the number of multigrid
          levels in the up/down solver. The function "ksp_solve" is
          timed. It is kind of slow and doesn't scale at all. <br>
          <br>
          Test1: 512^3 grid points<br>
          Core#        telescope_reduction_factor    <wbr>    MG levels#
          for up/down solver     Time for KSPSolve (s)<br>
          512             8                             <wbr>                   
          4 / 3                             <wbr>                 6.2466<br>
          4096           64                            <wbr>                  
          5 / 3                             <wbr>                 0.9361<br>
          32768         64                            <wbr>                  
          4 / 3                             <wbr>                 4.8914<br>
          <br>
          Test2: 1024^3 grid points<br>
          Core#        telescope_reduction_factor    <wbr>    MG levels#
          for up/down solver     Time for KSPSolve (s)<br>
          4096           64                            <wbr>                  
          5 / 4                               <wbr>               3.4139<br>
          8192           128                           <wbr>                 
          5 / 4                             <wbr>                 2.4196<br>
          16384         32                                        <wbr>      
          5 / 3                               <wbr>               5.4150<br>
          32768         64                            <wbr>                  
          5 / 3                             <wbr>                 5.6067<br>
          65536         128                           <wbr>                 
          5 / 3                             <wbr>                 6.5219</div>
      </blockquote>
      <div><br>
      </div>
      <div>You have to be very careful how you interpret these numbers.
        Your solver contains nested calls to KSPSolve, and unfortunately
        as a result the numbers you report include setup time. This will
        remain true even if you call KSPSetUp on the outermost KSP. </div>
      <div><br>
      </div>
      <div>Your email concerns scalability of the silver application, so
        let's focus on that issue.</div>
      <div><br>
      </div>
      <div>The only way to clearly separate setup from solve time is
        to perform two identical solves. The second solve will not
        require any setup. You should monitor the second solve via a new
        PetscStage.</div>
      <div><br>
      </div>
      <div>This was what I did in the telescope paper. It was the only
        way to understand the setup cost (and scaling) cf the solve time
        (and scaling).</div>
      <div><br>
      </div>
      <div>Thanks</div>
      <div>  Dave</div>
      <div>
        <div>
          <div><br>
          </div>
          <div> </div>
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div bgcolor="#FFFFFF" text="#000000"> I guess I didn't set
              the MG levels properly. What would be the efficient way to
              arrange the MG levels?<br>
              Also which preconditionr at the coarse mesh of the 2nd
              communicator should I use to improve the performance? <br>
              <br>
              I attached the test code and the petsc options file for
              the 1024^3 cube with 32768 cores. <br>
              <br>
              Thank you.<br>
              <br>
              Regards,<br>
              Frank<br>
              <br>
              <br>
              <br>
              <br>
              <br>
              <br>
              <div>On 09/15/2016 03:35 AM, Dave May wrote:<br>
              </div>
              <blockquote type="cite">
                <div dir="ltr">
                  <div>
                    <div>
                      <div>
                        <div>
                          <div>HI all,<br>
                            <br>
                          </div>
                          <div>I the only unexpected memory usage I can
                            see is associated with the call to
                            MatPtAP().<br>
                          </div>
                          <div>Here is something you can try
                            immediately.<br>
                          </div>
                        </div>
                        Run your code with the additional options<br>
                          -matrap 0 -matptap_scalable<br>
                        <br>
                      </div>
                      <div>I didn't realize this before, but the default
                        behaviour of MatPtAP in parallel is actually to
                        to explicitly form the transpose of P (e.g.
                        assemble R = P^T) and then compute R.A.P. <br>
                        You don't want to do this. The option -matrap 0
                        resolves this issue.<br>
                      </div>
                      <div><br>
                      </div>
                      <div>The implementation of P^T.A.P has two
                        variants. <br>
                        The scalable implementation (with respect to
                        memory usage) is selected via the second option
                        -matptap_scalable.</div>
                      <div><br>
                      </div>
                      <div>Try it out - I see a significant memory
                        reduction using these options for particular
                        mesh sizes / partitions.<br>
                      </div>
                      <div><br>
                      </div>
                      I've attached a cleaned up version of the code you
                      sent me.<br>
                    </div>
                    There were a number of memory leaks and other
                    issues.<br>
                  </div>
                  <div>The main points being<br>
                  </div>
                    * You should call DMDAVecGetArrayF90() before
                  VecAssembly{Begin,End}<br>
                    * You should call PetscFinalize(), otherwise the
                  option -log_summary (-log_view) will not display
                  anything once the program has completed.<br>
                  <div>
                    <div>
                      <div><br>
                        <br>
                      </div>
                      <div>Thanks,<br>
                      </div>
                      <div>  Dave<br>
                      </div>
                      <div>
                        <div>
                          <div><br>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
                <div class="gmail_extra"><br>
                  <div class="gmail_quote">On 15 September 2016 at
                    08:03, Hengjie Wang <span dir="ltr"><<a>hengjiew@uci.edu</a>></span>
                    wrote:<br>
                    <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                      <div bgcolor="#FFFFFF" text="#000000"> Hi Dave,<br>
                        <br>
                        Sorry, I should have put more comment to explain
                        the code.  <br>
                        The number of process in each dimension is the
                        same: Px = Py=Pz=P. So is the domain size.<br>
                        So if the you want to run the code for a  512^3
                        grid points on 16^3 cores, you need to set "-N
                        512 -P 16" in the command line.<br>
                        I add more comments and also fix an error in the
                        attached code. ( The error only effects the
                        accuracy of solution but not the memory usage. )
                        <br>
                        <div><br>
                          Thank you.<span><font color="#888888"><br>
                              Frank</font></span>
                          <div>
                            <div><br>
                              <br>
                              On 9/14/2016 9:05 PM, Dave May wrote:<br>
                            </div>
                          </div>
                        </div>
                        <div>
                          <div>
                            <blockquote type="cite"><br>
                              <br>
                              On Thursday, 15 September 2016, Dave May
                              <<a>dave.mayhem23@gmail.com</a>>
                              wrote:<br>
                              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
                                <br>
                                On Thursday, 15 September 2016, frank
                                <<a>hengjiew@uci.edu</a>>
                                wrote:<br>
                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                  <div bgcolor="#FFFFFF" text="#000000">
                                    Hi, <br>
                                    <br>
                                    I write a simple code to re-produce
                                    the error. I hope this can help to
                                    diagnose the problem.<br>
                                    The code just solves a 3d poisson
                                    equation. </div>
                                </blockquote>
                                <div><br>
                                </div>
                                <div>Why is the stencil width a runtime
                                  parameter?? And why is the default
                                  value 2? For 7-pnt FD Laplace, you
                                  only need a stencil width of 1. </div>
                                <div><br>
                                </div>
                                <div>Was this choice made to mimic
                                  something in the real application
                                  code?</div>
                              </blockquote>
                              <div><br>
                              </div>
                              Please ignore - I misunderstood your usage
                              of the param set by -P
                              <div>
                                <div> </div>
                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                  <div> </div>
                                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                    <div bgcolor="#FFFFFF" text="#000000"><br>
                                      I run the code on a 1024^3 mesh.
                                      The process partition is 32 * 32 *
                                      32. That's when I re-produce the
                                      OOM error. Each core has about 2G
                                      memory.<br>
                                      I also run the code on a 512^3
                                      mesh with 16 * 16 * 16 processes.
                                      The ksp solver works fine. <br>
                                      I attached the code,
                                      ksp_view_pre's output and my petsc
                                      option file.<br>
                                      <br>
                                      Thank you.<br>
                                      Frank<br>
                                      <div><br>
                                        On 09/09/2016 06:38 PM, Hengjie
                                        Wang wrote:<br>
                                      </div>
                                      <blockquote type="cite">Hi Barry, 
                                        <div><br>
                                        </div>
                                        <div>I checked. On the
                                          supercomputer, I had the
                                          option "-ksp_view_pre" but it
                                          is not in file I sent you. I
                                          am sorry for the confusion.</div>
                                        <div><br>
                                        </div>
                                        <div>Regards,</div>
                                        <div>Frank<span></span><br>
                                          <br>
                                          On Friday, September 9, 2016,
                                          Barry Smith <<a>bsmith@mcs.anl.gov</a>>
                                          wrote:<br>
                                          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
                                            > On Sep 9, 2016, at 3:11
                                            PM, frank <<a>hengjiew@uci.edu</a>>
                                            wrote:<br>
                                            ><br>
                                            > Hi Barry,<br>
                                            ><br>
                                            > I think the first KSP
                                            view output is from
                                            -ksp_view_pre. Before I
                                            submitted the test, I was
                                            not sure whether there would
                                            be OOM error or not. So I
                                            added both -ksp_view_pre and
                                            -ksp_view.<br>
                                            <br>
                                              But the options file you
                                            sent specifically does NOT
                                            list the -ksp_view_pre so
                                            how could it be from that?<br>
                                            <br>
                                               Sorry to be pedantic but
                                            I've spent too much time in
                                            the past trying to debug
                                            from incorrect information
                                            and want to make sure that
                                            the information I have is
                                            correct before thinking.
                                            Please recheck exactly what
                                            happened. Rerun with the
                                            exact input file you emailed
                                            if that is needed.<br>
                                            <br>
                                               Barry<br>
                                            <br>
                                            ><br>
                                            > Frank<br>
                                            ><br>
                                            ><br>
                                            > On 09/09/2016 12:38 PM,
                                            Barry Smith wrote:<br>
                                            >>   Why does
                                            ksp_view2.txt have two KSP
                                            views in it while
                                            ksp_view1.txt has only one
                                            KSPView in it? Did you run
                                            two different solves in the
                                            2 case but not the one?<br>
                                            >><br>
                                            >>   Barry<br>
                                            >><br>
                                            >><br>
                                            >><br>
                                            >>> On Sep 9, 2016,
                                            at 10:56 AM, frank <<a>hengjiew@uci.edu</a>>
                                            wrote:<br>
                                            >>><br>
                                            >>> Hi,<br>
                                            >>><br>
                                            >>> I want to
                                            continue digging into the
                                            memory problem here.<br>
                                            >>> I did find a
                                            work around in the past,
                                            which is to use less cores
                                            per node so that each core
                                            has 8G memory. However this
                                            is deficient and expensive.
                                            I hope to locate the place
                                            that uses the most memory.<br>
                                            >>><br>
                                            >>> Here is a brief
                                            summary of the tests I did
                                            in past:<br>
                                            >>>> Test1: 
                                             Mesh 1536*128*384  | 
                                            Process Mesh 48*4*12<br>
                                            >>> Maximum (over
                                            computational time) process
                                            memory:           total
                                            7.0727e+08<br>
                                            >>> Current process
                                            memory:                     
                                                                       
                                                   total 7.0727e+08<br>
                                            >>> Maximum (over
                                            computational time) space
                                            PetscMalloc()ed:  total
                                            6.3908e+11<br>
                                            >>> Current space
                                            PetscMalloc()ed:           
                                                                       
                                                    total 1.8275e+09<br>
                                            >>><br>
                                            >>>> Test2:   
                                            Mesh 1536*128*384  | 
                                            Process Mesh 96*8*24<br>
                                            >>> Maximum (over
                                            computational time) process
                                            memory:           total
                                            5.9431e+09<br>
                                            >>> Current process
                                            memory:                     
                                                                       
                                                   total 5.9431e+09<br>
                                            >>> Maximum (over
                                            computational time) space
                                            PetscMalloc()ed:  total
                                            5.3202e+12<br>
                                            >>> Current space
                                            PetscMalloc()ed:           
                                                                       
                                                     total 5.4844e+09<br>
                                            >>><br>
                                            >>>> Test3:   
                                            Mesh 3072*256*768  | 
                                            Process Mesh 96*8*24<br>
                                            >>>     OOM( Out Of
                                            Memory ) killer of the
                                            supercomputer terminated the
                                            job during "KSPSolve".<br>
                                            >>><br>
                                            >>> I attached the
                                            output of ksp_view( the
                                            third test's output is from
                                            ksp_view_pre ), memory_view
                                            and also the petsc options.<br>
                                            >>><br>
                                            >>> In all the
                                            tests, each core can access
                                            about 2G memory. In test3,
                                            there are 4223139840
                                            non-zeros in the matrix.
                                            This will consume about
                                            1.74M, using double
                                            precision. Considering some
                                            extra memory used to store
                                            integer index, 2G memory
                                            should still be way enough.<br>
                                            >>><br>
                                            >>> Is there a way
                                            to find out which part of
                                            KSPSolve uses the most
                                            memory?<br>
                                            >>> Thank you so
                                            much.<br>
                                            >>><br>
                                            >>> BTW, there are
                                            4 options remains unused and
                                            I don't understand why they
                                            are omitted:<br>
                                            >>>
                                            -mg_coarse_telescope_mg_coarse<wbr>_ksp_type
                                            value: preonly<br>
                                            >>>
                                            -mg_coarse_telescope_mg_coarse<wbr>_pc_type
                                            value: bjacobi<br>
                                            >>>
                                            -mg_coarse_telescope_mg_levels<wbr>_ksp_max_it
                                            value: 1<br>
                                            >>>
                                            -mg_coarse_telescope_mg_levels<wbr>_ksp_type
                                            value: richardson<br>
                                            >>><br>
                                            >>><br>
                                            >>> Regards,<br>
                                            >>> Frank<br>
                                            >>><br>
                                            >>> On 07/13/2016
                                            05:47 PM, Dave May wrote:<br>
                                            >>>><br>
                                            >>>> On 14 July
                                            2016 at 01:07, frank <<a>hengjiew@uci.edu</a>>
                                            wrote:<br>
                                            >>>> Hi Dave,<br>
                                            >>>><br>
                                            >>>> Sorry for
                                            the late reply.<br>
                                            >>>> Thank you
                                            so much for your detailed
                                            reply.<br>
                                            >>>><br>
                                            >>>> I have a
                                            question about the
                                            estimation of the memory
                                            usage. There are 4223139840
                                            allocated non-zeros and
                                            18432 MPI processes. Double
                                            precision is used. So the
                                            memory per process is:<br>
                                            >>>> 
                                             4223139840 * 8bytes / 18432
                                            / 1024 / 1024 = 1.74M ?<br>
                                            >>>> Did I do
                                            sth wrong here? Because this
                                            seems too small.<br>
                                            >>>><br>
                                            >>>> No - I
                                            totally f***ed it up. You
                                            are correct. That'll teach
                                            me for fumbling around with
                                            my iphone calculator and not
                                            using my brain. (Note that
                                            to convert to MB just divide
                                            by 1e6, not 1024^2 -
                                            although I apparently cannot
                                            convert between units
                                            correctly....)<br>
                                            >>>><br>
                                            >>>> From the
                                            PETSc objects associated
                                            with the solver, It looks
                                            like it _should_ run with
                                            2GB per MPI rank. Sorry for
                                            my mistake. Possibilities
                                            are: somewhere in your usage
                                            of PETSc you've introduced a
                                            memory leak; PETSc is doing
                                            a huge over allocation (e.g.
                                            as per our discussion of
                                            MatPtAP); or in your
                                            application code there are
                                            other objects you have
                                            forgotten to log the memory
                                            for.<br>
                                            >>>><br>
                                            >>>><br>
                                            >>>><br>
                                            >>>> I am
                                            running this job on
                                            Bluewater<br>
                                            >>>> I am using
                                            the 7 points FD stencil in
                                            3D.<br>
                                            >>>><br>
                                            >>>> I thought
                                            so on both counts.<br>
                                            >>>><br>
                                            >>>> I apologize
                                            that I made a stupid mistake
                                            in computing the memory per
                                            core. My settings render
                                            each core can access only 2G
                                            memory on average instead of
                                            8G which I mentioned in
                                            previous email. I re-run the
                                            job with 8G memory per core
                                            on average and there is no
                                            "Out Of Memory" error. I
                                            would do more test to see if
                                            there is still some memory
                                            issue.<br>
                                            >>>><br>
                                            >>>> Ok. I'd
                                            still like to know where the
                                            memory was being used since
                                            my estimates were off.<br>
                                            >>>><br>
                                            >>>><br>
                                            >>>> Thanks,<br>
                                            >>>>   Dave<br>
                                            >>>><br>
                                            >>>> Regards,<br>
                                            >>>> Frank<br>
                                            >>>><br>
                                            >>>><br>
                                            >>>><br>
                                            >>>> On
                                            07/11/2016 01:18 PM, Dave
                                            May wrote:<br>
                                            >>>>> Hi
                                            Frank,<br>
                                            >>>>><br>
                                            >>>>><br>
                                            >>>>> On 11
                                            July 2016 at 19:14, frank
                                            <<a>hengjiew@uci.edu</a>>
                                            wrote:<br>
                                            >>>>> Hi
                                            Dave,<br>
                                            >>>>><br>
                                            >>>>> I
                                            re-run the test using
                                            bjacobi as the
                                            preconditioner on the coarse
                                            mesh of telescope. The Grid
                                            is 3072*256*768 and process
                                            mesh is 96*8*24. The petsc
                                            option file is attached.<br>
                                            >>>>> I still
                                            got the "Out Of Memory"
                                            error. The error occurred
                                            before the linear solver
                                            finished one step. So I
                                            don't have the full info
                                            from ksp_view. The info from
                                            ksp_view_pre is attached.<br>
                                            >>>>><br>
                                            >>>>> Okay -
                                            that is essentially useless
                                            (sorry)<br>
                                            >>>>><br>
                                            >>>>> It
                                            seems to me that the error
                                            occurred when the
                                            decomposition was going to
                                            be changed.<br>
                                            >>>>><br>
                                            >>>>> Based
                                            on what information?<br>
                                            >>>>> Running
                                            with -info would give us
                                            more clues, but will create
                                            a ton of output.<br>
                                            >>>>> Please
                                            try running the case which
                                            failed with -info<br>
                                            >>>>>  I had
                                            another test with a grid of
                                            1536*128*384 and the same
                                            process mesh as above. There
                                            was no error. The ksp_view
                                            info is attached for
                                            comparison.<br>
                                            >>>>> Thank
                                            you.<br>
                                            >>>>><br>
                                            >>>>><br>
                                            >>>>> [3]
                                            Here is my crude estimate of
                                            your memory usage.<br>
                                            >>>>> I'll
                                            target the biggest memory
                                            hogs only to get an order of
                                            magnitude estimate<br>
                                            >>>>><br>
                                            >>>>> * The
                                            Fine grid operator contains
                                            4223139840 non-zeros -->
                                            1.8 GB per MPI rank assuming
                                            double precision.<br>
                                            >>>>> The
                                            indices for the AIJ could
                                            amount to another 0.3 GB
                                            (assuming 32 bit integers)<br>
                                            >>>>><br>
                                            >>>>> * You
                                            use 5 levels of coarsening,
                                            so the other operators
                                            should represent
                                            (collectively)<br>
                                            >>>>> 2.1 / 8
                                            + 2.1/8^2 + 2.1/8^3 +
                                            2.1/8^4  ~ 300 MB per MPI
                                            rank on the communicator
                                            with 18432 ranks.<br>
                                            >>>>> The
                                            coarse grid should consume ~
                                            0.5 MB per MPI rank on the
                                            communicator with 18432
                                            ranks.<br>
                                            >>>>><br>
                                            >>>>> * You
                                            use a reduction factor of
                                            64, making the new
                                            communicator with 288 MPI
                                            ranks.<br>
                                            >>>>>
                                            PCTelescope will first
                                            gather a temporary matrix
                                            associated with your coarse
                                            level operator assuming a
                                            comm size of 288 living on
                                            the comm with size 18432.<br>
                                            >>>>> This
                                            matrix will require
                                            approximately 0.5 * 64 = 32
                                            MB per core on the 288
                                            ranks.<br>
                                            >>>>> This
                                            matrix is then used to form
                                            a new MPIAIJ matrix on the
                                            subcomm, thus require
                                            another 32 MB per rank.<br>
                                            >>>>> The
                                            temporary matrix is now
                                            destroyed.<br>
                                            >>>>><br>
                                            >>>>> *
                                            Because a DMDA is detected,
                                            a permutation matrix is
                                            assembled.<br>
                                            >>>>> This
                                            requires 2 doubles per point
                                            in the DMDA.<br>
                                            >>>>> Your
                                            coarse DMDA contains 92 x 16
                                            x 48 points.<br>
                                            >>>>> Thus
                                            the permutation matrix will
                                            require < 1 MB per MPI
                                            rank on the sub-comm.<br>
                                            >>>>><br>
                                            >>>>> *
                                            Lastly, the matrix is
                                            permuted. This uses
                                            MatPtAP(), but the resulting
                                            operator will have the same
                                            memory footprint as the
                                            unpermuted matrix (32 MB).
                                            At any stage in PCTelescope,
                                            only 2 operators of size 32
                                            MB are held in memory when
                                            the DMDA is provided.<br>
                                            >>>>><br>
                                            >>>>> From my
                                            rough estimates, the worst
                                            case memory foot print for
                                            any given core, given your
                                            options is approximately<br>
                                            >>>>> 2100 MB
                                            + 300 MB + 32 MB + 32 MB + 1
                                            MB  = 2465 MB<br>
                                            >>>>> This is
                                            way below 8 GB.<br>
                                            >>>>><br>
                                            >>>>> Note
                                            this estimate completely
                                            ignores:<br>
                                            >>>>> (1) the
                                            memory required for the
                                            restriction operator,<br>
                                            >>>>> (2) the
                                            potential growth in the
                                            number of non-zeros per row
                                            due to Galerkin coarsening
                                            (I wished -ksp_view_pre
                                            reported the output from
                                            MatView so we could see the
                                            number of non-zeros required
                                            by the coarse level
                                            operators)<br>
                                            >>>>> (3) all
                                            temporary vectors required
                                            by the CG solver, and those
                                            required by the smoothers.<br>
                                            >>>>> (4)
                                            internal memory allocated by
                                            MatPtAP<br>
                                            >>>>> (5)
                                            memory associated with IS's
                                            used within PCTelescope<br>
                                            >>>>><br>
                                            >>>>> So
                                            either I am completely off
                                            in my estimates, or you have
                                            not carefully estimated the
                                            memory usage of your
                                            application code. Hopefully
                                            others might examine/correct
                                            my rough estimates<br>
                                            >>>>><br>
                                            >>>>> Since I
                                            don't have your code I
                                            cannot access the latter.<br>
                                            >>>>> Since I
                                            don't have access to the
                                            same machine you are running
                                            on, I think we need to take
                                            a step back.<br>
                                            >>>>><br>
                                            >>>>> [1]
                                            What machine are you running
                                            on? Send me a URL if its
                                            available<br>
                                            >>>>><br>
                                            >>>>> [2]
                                            What discretization are you
                                            using? (I am guessing a
                                            scalar 7 point FD stencil)<br>
                                            >>>>> If it's
                                            a 7 point FD stencil, we
                                            should be able to examine
                                            the memory usage of your
                                            solver configuration using a
                                            standard, light weight
                                            existing PETSc example, run
                                            on your machine at the same
                                            scale.<br>
                                            >>>>> This
                                            would hopefully enable us to
                                            correctly evaluate the
                                            actual memory usage required
                                            by the solver configuration
                                            you are using.<br>
                                            >>>>><br>
                                            >>>>> Thanks,<br>
                                            >>>>>   Dave<br>
                                            >>>>><br>
                                            >>>>><br>
                                            >>>>> Frank<br>
                                            >>>>><br>
                                            >>>>><br>
                                            >>>>><br>
                                            >>>>><br>
                                            >>>>> On
                                            07/08/2016 10:38 PM, Dave
                                            May wrote:<br>
                                            >>>>>><br>
                                            >>>>>> On
                                            Saturday, 9 July 2016, frank
                                            <<a>hengjiew@uci.edu</a>>
                                            wrote:<br>
                                            >>>>>> Hi
                                            Barry and Dave,<br>
                                            >>>>>><br>
                                            >>>>>>
                                            Thank both of you for the
                                            advice.<br>
                                            >>>>>><br>
                                            >>>>>>
                                            @Barry<br>
                                            >>>>>> I
                                            made a mistake in the file
                                            names in last email. I
                                            attached the correct files
                                            this time.<br>
                                            >>>>>> For
                                            all the three tests,
                                            'Telescope' is used as the
                                            coarse preconditioner.<br>
                                            >>>>>><br>
                                            >>>>>> ==
                                            Test1:   Grid:
                                            1536*128*384,   Process
                                            Mesh: 48*4*12<br>
                                            >>>>>>
                                            Part of the memory usage: 
                                            Vector   125            124
                                            3971904     0.<br>
                                            >>>>>>   
                                                                       
                                                          Matrix   101
                                            101      9462372     0<br>
                                            >>>>>><br>
                                            >>>>>> ==
                                            Test2: Grid: 1536*128*384, 
                                             Process Mesh: 96*8*24<br>
                                            >>>>>>
                                            Part of the memory usage: 
                                            Vector   125            124
                                            681672     0.<br>
                                            >>>>>>   
                                                                       
                                                          Matrix   101
                                            101      1462180     0.<br>
                                            >>>>>><br>
                                            >>>>>> In
                                            theory, the memory usage in
                                            Test1 should be 8 times of
                                            Test2. In my case, it is
                                            about 6 times.<br>
                                            >>>>>><br>
                                            >>>>>> ==
                                            Test3: Grid: 3072*256*768, 
                                             Process Mesh: 96*8*24.
                                            Sub-domain per process:
                                            32*32*32<br>
                                            >>>>>>
                                            Here I get the out of memory
                                            error.<br>
                                            >>>>>><br>
                                            >>>>>> I
                                            tried to use -mg_coarse
                                            jacobi. In this way, I don't
                                            need to set
                                            -mg_coarse_ksp_type and
                                            -mg_coarse_pc_type
                                            explicitly, right?<br>
                                            >>>>>> The
                                            linear solver didn't work in
                                            this case. Petsc output some
                                            errors.<br>
                                            >>>>>><br>
                                            >>>>>>
                                            @Dave<br>
                                            >>>>>> In
                                            test3, I use only one
                                            instance of 'Telescope'. On
                                            the coarse mesh of
                                            'Telescope', I used LU as
                                            the preconditioner instead
                                            of SVD.<br>
                                            >>>>>> If
                                            my set the levels correctly,
                                            then on the last coarse mesh
                                            of MG where it calls
                                            'Telescope', the sub-domain
                                            per process is 2*2*2.<br>
                                            >>>>>> On
                                            the last coarse mesh of
                                            'Telescope', there is only
                                            one grid point per process.<br>
                                            >>>>>> I
                                            still got the OOM error. The
                                            detailed petsc option file
                                            is attached.<br>
                                            >>>>>><br>
                                            >>>>>> Do
                                            you understand the expected
                                            memory usage for the
                                            particular parallel LU
                                            implementation you are
                                            using? I don't (seriously).
                                            Replace LU with bjacobi and
                                            re-run this test. My point
                                            about solver debugging is
                                            still valid.<br>
                                            >>>>>><br>
                                            >>>>>> And
                                            please send the result of
                                            KSPView so we can see what
                                            is actually used in the
                                            computations<br>
                                            >>>>>><br>
                                            >>>>>>
                                            Thanks<br>
                                            >>>>>> 
                                             Dave<br>
                                            >>>>>><br>
                                            >>>>>><br>
                                            >>>>>>
                                            Thank you so much.<br>
                                            >>>>>><br>
                                            >>>>>>
                                            Frank<br>
                                            >>>>>><br>
                                            >>>>>><br>
                                            >>>>>><br>
                                            >>>>>> On
                                            07/06/2016 02:51 PM, Barry
                                            Smith wrote:<br>
                                            >>>>>> On
                                            Jul 6, 2016, at 4:19 PM,
                                            frank <<a>hengjiew@uci.edu</a>>
                                            wrote:<br>
                                            >>>>>><br>
                                            >>>>>> Hi
                                            Barry,<br>
                                            >>>>>><br>
                                            >>>>>>
                                            Thank you for you advice.<br>
                                            >>>>>> I
                                            tried three test. In the 1st
                                            test, the grid is
                                            3072*256*768 and the process
                                            mesh is 96*8*24.<br>
                                            >>>>>> The
                                            linear solver is 'cg' the
                                            preconditioner is 'mg' and
                                            'telescope' is used as the
                                            preconditioner at the coarse
                                            mesh.<br>
                                            >>>>>> The
                                            system gives me the "Out of
                                            Memory" error before the
                                            linear system is completely
                                            solved.<br>
                                            >>>>>> The
                                            info from '-ksp_view_pre' is
                                            attached. I seems to me that
                                            the error occurs when it
                                            reaches the coarse mesh.<br>
                                            >>>>>><br>
                                            >>>>>> The
                                            2nd test uses a grid of
                                            1536*128*384 and process
                                            mesh is 96*8*24. The 3rd   
                                                                       
                                                         test uses the
                                            same grid but a different
                                            process mesh 48*4*12.<br>
                                            >>>>>>   
                                             Are you sure this is right?
                                            The total matrix and vector
                                            memory usage goes from 2nd
                                            test<br>
                                            >>>>>>   
                                                        Vector   384   
                                                    383      8,193,712 
                                               0.<br>
                                            >>>>>>   
                                                        Matrix   103   
                                                    103     11,508,688 
                                               0.<br>
                                            >>>>>> to
                                            3rd test<br>
                                            >>>>>>   
                                                       Vector   384     
                                                  383      1,590,520   
                                             0.<br>
                                            >>>>>>   
                                                        Matrix   103   
                                                    103      3,508,664 
                                               0.<br>
                                            >>>>>>
                                            that is the memory usage got
                                            smaller but if you have only
                                            1/8th the processes and the
                                            same grid it should have
                                            gotten about 8 times bigger.
                                            Did you maybe cut the grid
                                            by a factor of 8 also? If so
                                            that still doesn't explain
                                            it because the memory usage
                                            changed by a factor of 5
                                            something for the vectors
                                            and 3 something for the
                                            matrices.<br>
                                            >>>>>><br>
                                            >>>>>><br>
                                            >>>>>> The
                                            linear solver and petsc
                                            options in 2nd and 3rd tests
                                            are the same in 1st test.
                                            The linear solver works fine
                                            in both test.<br>
                                            >>>>>> I
                                            attached the memory usage of
                                            the 2nd and 3rd tests. The
                                            memory info is from the
                                            option '-log_summary'. I
                                            tried to use '-momery_info'
                                            as you suggested, but in my
                                            case petsc treated it as an
                                            unused option. It output
                                            nothing about the memory. Do
                                            I need to add sth to my code
                                            so I can use '-memory_info'?<br>
                                            >>>>>>   
                                             Sorry, my mistake the
                                            option is -memory_view<br>
                                            >>>>>><br>
                                            >>>>>>   
                                            Can you run the one case
                                            with -memory_view and
                                            -mg_coarse jacobi
                                            -ksp_max_it 1 (just so it
                                            doesn't iterate forever) to
                                            see how much memory is used
                                            without the telescope? Also
                                            run case 2 the same way.<br>
                                            >>>>>><br>
                                            >>>>>>   
                                            Barry<br>
                                            >>>>>><br>
                                            >>>>>><br>
                                            >>>>>><br>
                                            >>>>>> In
                                            both tests the memory usage
                                            is not large.<br>
                                            >>>>>><br>
                                            >>>>>> It
                                            seems to me that it might be
                                            the 'telescope' 
                                            preconditioner that
                                            allocated a lot of memory
                                            and caused the error in the
                                            1st test.<br>
                                            >>>>>> Is
                                            there is a way to show how
                                            much memory it allocated?<br>
                                            >>>>>><br>
                                            >>>>>>
                                            Frank<br>
                                            >>>>>><br>
                                            >>>>>> On
                                            07/05/2016 03:37 PM, Barry
                                            Smith wrote:<br>
                                            >>>>>>   
                                            Frank,<br>
                                            >>>>>><br>
                                            >>>>>>   
                                              You can run with
                                            -ksp_view_pre to have it
                                            "view" the KSP before the
                                            solve so hopefully it gets
                                            that far.<br>
                                            >>>>>><br>
                                            >>>>>>   
                                               Please run the problem
                                            that does fit with
                                            -memory_info when the
                                            problem completes it will
                                            show the "high water mark"
                                            for PETSc allocated memory
                                            and total memory used. We
                                            first want to look at these
                                            numbers to see if it is
                                            using more memory than you
                                            expect. You could also run
                                            with say half the grid
                                            spacing to see how the
                                            memory usage scaled with the
                                            increase in grid points.
                                            Make the runs also with
                                            -log_view and send all the
                                            output from these options.<br>
                                            >>>>>><br>
                                            >>>>>>   
                                             Barry<br>
                                            >>>>>><br>
                                            >>>>>> On
                                            Jul 5, 2016, at 5:23 PM,
                                            frank <<a>hengjiew@uci.edu</a>>
                                            wrote:<br>
                                            >>>>>><br>
                                            >>>>>> Hi,<br>
                                            >>>>>><br>
                                            >>>>>> I
                                            am using the CG ksp solver
                                            and Multigrid
                                            preconditioner  to solve a
                                            linear system in parallel.<br>
                                            >>>>>> I
                                            chose to use the 'Telescope'
                                            as the preconditioner on the
                                            coarse mesh for its good
                                            performance.<br>
                                            >>>>>> The
                                            petsc options file is
                                            attached.<br>
                                            >>>>>><br>
                                            >>>>>> The
                                            domain is a 3d box.<br>
                                            >>>>>> It
                                            works well when the grid is 
                                            1536*128*384 and the process
                                            mesh is 96*8*24. When I
                                            double the size of grid and 
                                                                       
                                                               keep the
                                            same process mesh and petsc
                                            options, I get an "out of
                                            memory" error from the
                                            super-cluster I am using.<br>
                                            >>>>>>
                                            Each process has access to
                                            at least 8G memory, which
                                            should be more than enough
                                            for my application. I am
                                            sure that all the other
                                            parts of my code( except the
                                            linear solver ) do not use
                                            much memory. So I doubt if
                                            there is something wrong
                                            with the linear solver.<br>
                                            >>>>>> The
                                            error occurs before the
                                            linear system is completely
                                            solved so I don't have the
                                            info from ksp view. I am not
                                            able to re-produce the error
                                            with a smaller problem
                                            either.<br>
                                            >>>>>> In
                                            addition,  I tried to use
                                            the block jacobi as the
                                            preconditioner with the same
                                            grid and same decomposition.
                                            The linear solver runs
                                            extremely slow but there is
                                            no memory error.<br>
                                            >>>>>><br>
                                            >>>>>> How
                                            can I diagnose what exactly
                                            cause the error?<br>
                                            >>>>>>
                                            Thank you so much.<br>
                                            >>>>>><br>
                                            >>>>>>
                                            Frank<br>
                                            >>>>>>
                                            <petsc_options.txt><br>
                                            >>>>>>
                                            <ksp_view_pre.txt><memory_test<wbr>2.txt><memory_test3.txt><petsc<wbr>_options.txt><br>
                                            >>>>>><br>
                                            >>>>><br>
                                            >>>><br>
                                            >>>
                                            <ksp_view1.txt><ksp_view2.txt><wbr><ksp_view3.txt><memory1.txt><m<wbr>emory2.txt><petsc_options1.txt<wbr>><petsc_options2.txt><petsc_op<wbr>tions3.txt><br>
                                            ><br>
                                            <br>
                                          </blockquote>
                                        </div>
                                      </blockquote>
                                      <br>
                                    </div>
                                  </blockquote>
                                  <div> </div>
                                </blockquote>
                              </div>
                            </blockquote>
                            <br>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                  <br>
                </div>
              </blockquote>
              <br>
            </div>
          </blockquote>
          <div> </div>
          <div> </div>
          <div> </div>
        </div>
      </div>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>
</div></div>