<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Oct 6, 2016 at 7:33 PM, frank <span dir="ltr"><<a href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000">

    <p>Dear Dave,</p>

    Follow your advice, I solve the identical equation twice and time

    two steps separately. The result is below:<br>

    <br>

    Test: 1024^3 grid points<br>

    Cores#    reduction factor    MG levels#    time of 1st solve    2nd

    time<br>

    4096            64                        6 + 3                

    3.85                          <wbr>   1.75<br>

    8192          128                       5 + 3                 

    5.52                          <wbr>   0.91<br>

    16384        256                       5 + 3                  5.37  

                              0.52<br>

    32768        512                       5 + 4                 

    3.03                             0.36<br>

    32768     64 | 8                   4 | 3 | 3                  2.80  

                              0.43<br>

    65536      1024                      5 + 4                  3.38    

                             0.59<br>

    65536    32 | 32                  4 | 4 | 3                  2.14  

                              0.22<br>

    <br>

    I also attached the log_view info from all the run.  The file

    is names by the cores# + reduction factor.<br>

    The ksp_view and petsc_options for  the 1st run are also included.

    Others are similar. The only differences are the reduction factor

    and mg levels.<br>

    <br>

    ** The time for the 1st solve is generally much larger. Is this

    because the ksp solver on the sub-communicator is set up during the

    1st solve?<br></div></blockquote><div><br></div><div>All setup is done in the first solve.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

    ** The time for 1st solve does not scale. <br>

        In practice, I am solving a variable coefficient  Poisson

    equation. I need to build the matrix every time step. Therefore,

    each step is similar to the 1st solve which does not scale. Is there

    a way I can improve the performance?<br></div></blockquote><div><br></div><div>You could use rediscretization instead of Galerkin to produce the coarse operators.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

    ** The 2nd solve scales but not quite well for more than 16384

    cores.<br></div></blockquote><div><br></div><div>How well were you looking for? This is strong scaling, which is has an Amdahl's Law limit.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

        It seems to me that the performance depends on the tuning of MG

    levels on the sub-communicator(s).<br>

        Is there some general strategies regarding how to distribute the

    levels? or when to use multiple sub-communicators ? <br></div></blockquote><div><br></div><div>Also, you use CG/MG when FMG by itself would probably be faster. Your smoother is likely not strong enough, and you</div><div>should use something like V(2,2). There is a lot of tuning that is possible, but difficult to automate.</div><div><br></div><div>  Thanks,</div><div><br></div><div>     Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

    Thank you.<br>

        <br>

    Regards,<br>

    Frank<div><div class="h5"><br>

    <br>

    <br>

    <br>

    <br>

    <div class="m_-3012109709631955293moz-cite-prefix">On 10/04/2016 12:56 PM, Dave May wrote:<br>

    </div>

    <blockquote type="cite"><br>

      <br>

      On Tuesday, 4 October 2016, frank <<a href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>> wrote:<br>

      <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

        <div bgcolor="#FFFFFF" text="#000000">

          <p>Hi,</p>

          This question is follow-up of the thread "Question about

          memory usage in Multigrid preconditioner".<br>

          I used to have the "Out of Memory(OOM)" problem when using the

          CG+Telescope MG solver with 32768 cores. Adding the "-matrap

          0; -matptap_scalable" option did solve that problem. <br>

          <br>

          Then I test the scalability by solving a 3d poisson eqn for 1

          step. I used one sub-communicator in all the tests. The

          difference between the petsc options in those tests are: 1 the

          pc_telescope_reduction_factor; 2 the number of multigrid

          levels in the up/down solver. The function "ksp_solve" is

          timed. It is kind of slow and doesn't scale at all. <br>

          <br>

          Test1: 512^3 grid points<br>

          Core#        telescope_reduction_factor    <wbr>    MG levels#

          for up/down solver     Time for KSPSolve (s)<br>

          512             8                             <wbr>                   

          4 / 3                             <wbr>                 6.2466<br>

          4096           64                            <wbr>                  

          5 / 3                             <wbr>                 0.9361<br>

          32768         64                            <wbr>                  

          4 / 3                             <wbr>                 4.8914<br>

          <br>

          Test2: 1024^3 grid points<br>

          Core#        telescope_reduction_factor    <wbr>    MG levels#

          for up/down solver     Time for KSPSolve (s)<br>

          4096           64                            <wbr>                  

          5 / 4                               <wbr>               3.4139<br>

          8192           128                           <wbr>                 

          5 / 4                             <wbr>                 2.4196<br>

          16384         32                                        <wbr>      

          5 / 3                               <wbr>               5.4150<br>

          32768         64                            <wbr>                  

          5 / 3                             <wbr>                 5.6067<br>

          65536         128                           <wbr>                 

          5 / 3                             <wbr>                 6.5219</div>

      </blockquote>

      <div><br>

      </div>

      <div>You have to be very careful how you interpret these numbers.

        Your solver contains nested calls to KSPSolve, and unfortunately

        as a result the numbers you report include setup time. This will

        remain true even if you call KSPSetUp on the outermost KSP. </div>

      <div><br>

      </div>

      <div>Your email concerns scalability of the silver application, so

        let's focus on that issue.</div>

      <div><br>

      </div>

      <div>The only way to clearly separate setup from solve time is

        to perform two identical solves. The second solve will not

        require any setup. You should monitor the second solve via a new

        PetscStage.</div>

      <div><br>

      </div>

      <div>This was what I did in the telescope paper. It was the only

        way to understand the setup cost (and scaling) cf the solve time

        (and scaling).</div>

      <div><br>

      </div>

      <div>Thanks</div>

      <div>  Dave</div>

      <div>

        <div>

          <div><br>

          </div>

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div bgcolor="#FFFFFF" text="#000000"> I guess I didn't set

              the MG levels properly. What would be the efficient way to

              arrange the MG levels?<br>

              Also which preconditionr at the coarse mesh of the 2nd

              communicator should I use to improve the performance? <br>

              <br>

              I attached the test code and the petsc options file for

              the 1024^3 cube with 32768 cores. <br>

              <br>

              Thank you.<br>

              <br>

              Regards,<br>

              Frank<br>

              <br>

              <br>

              <br>

              <br>

              <br>

              <br>

              <div>On 09/15/2016 03:35 AM, Dave May wrote:<br>

              </div>

              <blockquote type="cite">

                <div dir="ltr">

                  <div>

                    <div>

                      <div>

                        <div>

                          <div>HI all,<br>

                            <br>

                          </div>

                          <div>I the only unexpected memory usage I can

                            see is associated with the call to

                            MatPtAP().<br>

                          </div>

                          <div>Here is something you can try

                            immediately.<br>

                          </div>

                        </div>

                        Run your code with the additional options<br>

                          -matrap 0 -matptap_scalable<br>

                        <br>

                      </div>

                      <div>I didn't realize this before, but the default

                        behaviour of MatPtAP in parallel is actually to

                        to explicitly form the transpose of P (e.g.

                        assemble R = P^T) and then compute R.A.P. <br>

                        You don't want to do this. The option -matrap 0

                        resolves this issue.<br>

                      </div>

                      <div><br>

                      </div>

                      <div>The implementation of P^T.A.P has two

                        variants. <br>

                        The scalable implementation (with respect to

                        memory usage) is selected via the second option

                        -matptap_scalable.</div>

                      <div><br>

                      </div>

                      <div>Try it out - I see a significant memory

                        reduction using these options for particular

                        mesh sizes / partitions.<br>

                      </div>

                      <div><br>

                      </div>

                      I've attached a cleaned up version of the code you

                      sent me.<br>

                    </div>

                    There were a number of memory leaks and other

                    issues.<br>

                  </div>

                  <div>The main points being<br>

                  </div>

                    * You should call DMDAVecGetArrayF90() before

                  VecAssembly{Begin,End}<br>

                    * You should call PetscFinalize(), otherwise the

                  option -log_summary (-log_view) will not display

                  anything once the program has completed.<br>

                  <div>

                    <div>

                      <div><br>

                        <br>

                      </div>

                      <div>Thanks,<br>

                      </div>

                      <div>  Dave<br>

                      </div>

                      <div>

                        <div>

                          <div><br>

                          </div>

                        </div>

                      </div>

                    </div>

                  </div>

                </div>

                <div class="gmail_extra"><br>

                  <div class="gmail_quote">On 15 September 2016 at

                    08:03, Hengjie Wang <span dir="ltr"><<a>hengjiew@uci.edu</a>></span>

                    wrote:<br>

                    <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                      <div bgcolor="#FFFFFF" text="#000000"> Hi Dave,<br>

                        <br>

                        Sorry, I should have put more comment to explain

                        the code.  <br>

                        The number of process in each dimension is the

                        same: Px = Py=Pz=P. So is the domain size.<br>

                        So if the you want to run the code for a  512^3

                        grid points on 16^3 cores, you need to set "-N

                        512 -P 16" in the command line.<br>

                        I add more comments and also fix an error in the

                        attached code. ( The error only effects the

                        accuracy of solution but not the memory usage. )

                        <br>

                        <div><br>

                          Thank you.<span><font color="#888888"><br>

                              Frank</font></span>

                          <div>

                            <div><br>

                              <br>

                              On 9/14/2016 9:05 PM, Dave May wrote:<br>

                            </div>

                          </div>

                        </div>

                        <div>

                          <div>

                            <blockquote type="cite"><br>

                              <br>

                              On Thursday, 15 September 2016, Dave May

                              <<a>dave.mayhem23@gmail.com</a>>

                              wrote:<br>

                              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

                                <br>

                                On Thursday, 15 September 2016, frank

                                <<a>hengjiew@uci.edu</a>>

                                wrote:<br>

                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                                  <div bgcolor="#FFFFFF" text="#000000">

                                    Hi, <br>

                                    <br>

                                    I write a simple code to re-produce

                                    the error. I hope this can help to

                                    diagnose the problem.<br>

                                    The code just solves a 3d poisson

                                    equation. </div>

                                </blockquote>

                                <div><br>

                                </div>

                                <div>Why is the stencil width a runtime

                                  parameter?? And why is the default

                                  value 2? For 7-pnt FD Laplace, you

                                  only need a stencil width of 1. </div>

                                <div><br>

                                </div>

                                <div>Was this choice made to mimic

                                  something in the real application

                                  code?</div>

                              </blockquote>

                              <div><br>

                              </div>

                              Please ignore - I misunderstood your usage

                              of the param set by -P

                              <div>

                                <div> </div>

                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                                  <div> </div>

                                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                                    <div bgcolor="#FFFFFF" text="#000000"><br>

                                      I run the code on a 1024^3 mesh.

                                      The process partition is 32 * 32 *

                                      32. That's when I re-produce the

                                      OOM error. Each core has about 2G

                                      memory.<br>

                                      I also run the code on a 512^3

                                      mesh with 16 * 16 * 16 processes.

                                      The ksp solver works fine. <br>

                                      I attached the code,

                                      ksp_view_pre's output and my petsc

                                      option file.<br>

                                      <br>

                                      Thank you.<br>

                                      Frank<br>

                                      <div><br>

                                        On 09/09/2016 06:38 PM, Hengjie

                                        Wang wrote:<br>

                                      </div>

                                      <blockquote type="cite">Hi Barry, 

                                        <div><br>

                                        </div>

                                        <div>I checked. On the

                                          supercomputer, I had the

                                          option "-ksp_view_pre" but it

                                          is not in file I sent you. I

                                          am sorry for the confusion.</div>

                                        <div><br>

                                        </div>

                                        <div>Regards,</div>

                                        <div>Frank<span></span><br>

                                          <br>

                                          On Friday, September 9, 2016,

                                          Barry Smith <<a>bsmith@mcs.anl.gov</a>>

                                          wrote:<br>

                                          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

                                            > On Sep 9, 2016, at 3:11

                                            PM, frank <<a>hengjiew@uci.edu</a>>

                                            wrote:<br>

                                            ><br>

                                            > Hi Barry,<br>

                                            ><br>

                                            > I think the first KSP

                                            view output is from

                                            -ksp_view_pre. Before I

                                            submitted the test, I was

                                            not sure whether there would

                                            be OOM error or not. So I

                                            added both -ksp_view_pre and

                                            -ksp_view.<br>

                                            <br>

                                              But the options file you

                                            sent specifically does NOT

                                            list the -ksp_view_pre so

                                            how could it be from that?<br>

                                            <br>

                                               Sorry to be pedantic but

                                            I've spent too much time in

                                            the past trying to debug

                                            from incorrect information

                                            and want to make sure that

                                            the information I have is

                                            correct before thinking.

                                            Please recheck exactly what

                                            happened. Rerun with the

                                            exact input file you emailed

                                            if that is needed.<br>

                                            <br>

                                               Barry<br>

                                            <br>

                                            ><br>

                                            > Frank<br>

                                            ><br>

                                            ><br>

                                            > On 09/09/2016 12:38 PM,

                                            Barry Smith wrote:<br>

                                            >>   Why does

                                            ksp_view2.txt have two KSP

                                            views in it while

                                            ksp_view1.txt has only one

                                            KSPView in it? Did you run

                                            two different solves in the

                                            2 case but not the one?<br>

                                            >><br>

                                            >>   Barry<br>

                                            >><br>

                                            >><br>

                                            >><br>

                                            >>> On Sep 9, 2016,

                                            at 10:56 AM, frank <<a>hengjiew@uci.edu</a>>

                                            wrote:<br>

                                            >>><br>

                                            >>> Hi,<br>

                                            >>><br>

                                            >>> I want to

                                            continue digging into the

                                            memory problem here.<br>

                                            >>> I did find a

                                            work around in the past,

                                            which is to use less cores

                                            per node so that each core

                                            has 8G memory. However this

                                            is deficient and expensive.

                                            I hope to locate the place

                                            that uses the most memory.<br>

                                            >>><br>

                                            >>> Here is a brief

                                            summary of the tests I did

                                            in past:<br>

                                            >>>> Test1: 

                                             Mesh 1536*128*384  | 

                                            Process Mesh 48*4*12<br>

                                            >>> Maximum (over

                                            computational time) process

                                            memory:           total

                                            7.0727e+08<br>

                                            >>> Current process

                                            memory:                     

                                                   total 7.0727e+08<br>

                                            >>> Maximum (over

                                            computational time) space

                                            PetscMalloc()ed:  total

                                            6.3908e+11<br>

                                            >>> Current space

                                            PetscMalloc()ed:           

                                                    total 1.8275e+09<br>

                                            >>><br>

                                            >>>> Test2:   

                                            Mesh 1536*128*384  | 

                                            Process Mesh 96*8*24<br>

                                            >>> Maximum (over

                                            computational time) process

                                            memory:           total

                                            5.9431e+09<br>

                                            >>> Current process

                                            memory:                     

                                                   total 5.9431e+09<br>

                                            >>> Maximum (over

                                            computational time) space

                                            PetscMalloc()ed:  total

                                            5.3202e+12<br>

                                            >>> Current space

                                            PetscMalloc()ed:           

                                                     total 5.4844e+09<br>

                                            >>><br>

                                            >>>> Test3:   

                                            Mesh 3072*256*768  | 

                                            Process Mesh 96*8*24<br>

                                            >>>     OOM( Out Of

                                            Memory ) killer of the

                                            supercomputer terminated the

                                            job during "KSPSolve".<br>

                                            >>><br>

                                            >>> I attached the

                                            output of ksp_view( the

                                            third test's output is from

                                            ksp_view_pre ), memory_view

                                            and also the petsc options.<br>

                                            >>><br>

                                            >>> In all the

                                            tests, each core can access

                                            about 2G memory. In test3,

                                            there are 4223139840

                                            non-zeros in the matrix.

                                            This will consume about

                                            1.74M, using double

                                            precision. Considering some

                                            extra memory used to store

                                            integer index, 2G memory

                                            should still be way enough.<br>

                                            >>><br>

                                            >>> Is there a way

                                            to find out which part of

                                            KSPSolve uses the most

                                            memory?<br>

                                            >>> Thank you so

                                            much.<br>

                                            >>><br>

                                            >>> BTW, there are

                                            4 options remains unused and

                                            I don't understand why they

                                            are omitted:<br>

                                            >>>

                                            -mg_coarse_telescope_mg_coarse<wbr>_ksp_type

                                            value: preonly<br>

                                            >>>

                                            -mg_coarse_telescope_mg_coarse<wbr>_pc_type

                                            value: bjacobi<br>

                                            >>>

                                            -mg_coarse_telescope_mg_levels<wbr>_ksp_max_it

                                            value: 1<br>

                                            >>>

                                            -mg_coarse_telescope_mg_levels<wbr>_ksp_type

                                            value: richardson<br>

                                            >>><br>

                                            >>><br>

                                            >>> Regards,<br>

                                            >>> Frank<br>

                                            >>><br>

                                            >>> On 07/13/2016

                                            05:47 PM, Dave May wrote:<br>

                                            >>>><br>

                                            >>>> On 14 July

                                            2016 at 01:07, frank <<a>hengjiew@uci.edu</a>>

                                            wrote:<br>

                                            >>>> Hi Dave,<br>

                                            >>>><br>

                                            >>>> Sorry for

                                            the late reply.<br>

                                            >>>> Thank you

                                            so much for your detailed

                                            reply.<br>

                                            >>>><br>

                                            >>>> I have a

                                            question about the

                                            estimation of the memory

                                            usage. There are 4223139840

                                            allocated non-zeros and

                                            18432 MPI processes. Double

                                            precision is used. So the

                                            memory per process is:<br>

                                            >>>> 

                                             4223139840 * 8bytes / 18432

                                            / 1024 / 1024 = 1.74M ?<br>

                                            >>>> Did I do

                                            sth wrong here? Because this

                                            seems too small.<br>

                                            >>>><br>

                                            >>>> No - I

                                            totally f***ed it up. You

                                            are correct. That'll teach

                                            me for fumbling around with

                                            my iphone calculator and not

                                            using my brain. (Note that

                                            to convert to MB just divide

                                            by 1e6, not 1024^2 -

                                            although I apparently cannot

                                            convert between units

                                            correctly....)<br>

                                            >>>><br>

                                            >>>> From the

                                            PETSc objects associated

                                            with the solver, It looks

                                            like it _should_ run with

                                            2GB per MPI rank. Sorry for

                                            my mistake. Possibilities

                                            are: somewhere in your usage

                                            of PETSc you've introduced a

                                            memory leak; PETSc is doing

                                            a huge over allocation (e.g.

                                            as per our discussion of

                                            MatPtAP); or in your

                                            application code there are

                                            other objects you have

                                            forgotten to log the memory

                                            for.<br>

                                            >>>><br>

                                            >>>><br>

                                            >>>><br>

                                            >>>> I am

                                            running this job on

                                            Bluewater<br>

                                            >>>> I am using

                                            the 7 points FD stencil in

                                            3D.<br>

                                            >>>><br>

                                            >>>> I thought

                                            so on both counts.<br>

                                            >>>><br>

                                            >>>> I apologize

                                            that I made a stupid mistake

                                            in computing the memory per

                                            core. My settings render

                                            each core can access only 2G

                                            memory on average instead of

                                            8G which I mentioned in

                                            previous email. I re-run the

                                            job with 8G memory per core

                                            on average and there is no

                                            "Out Of Memory" error. I

                                            would do more test to see if

                                            there is still some memory

                                            issue.<br>

                                            >>>><br>

                                            >>>> Ok. I'd

                                            still like to know where the

                                            memory was being used since

                                            my estimates were off.<br>

                                            >>>><br>

                                            >>>><br>

                                            >>>> Thanks,<br>

                                            >>>>   Dave<br>

                                            >>>><br>

                                            >>>> Regards,<br>

                                            >>>> Frank<br>

                                            >>>><br>

                                            >>>><br>

                                            >>>><br>

                                            >>>> On

                                            07/11/2016 01:18 PM, Dave

                                            May wrote:<br>

                                            >>>>> Hi

                                            Frank,<br>

                                            >>>>><br>

                                            >>>>><br>

                                            >>>>> On 11

                                            July 2016 at 19:14, frank

                                            <<a>hengjiew@uci.edu</a>>

                                            wrote:<br>

                                            >>>>> Hi

                                            Dave,<br>

                                            >>>>><br>

                                            >>>>> I

                                            re-run the test using

                                            bjacobi as the

                                            preconditioner on the coarse

                                            mesh of telescope. The Grid

                                            is 3072*256*768 and process

                                            mesh is 96*8*24. The petsc

                                            option file is attached.<br>

                                            >>>>> I still

                                            got the "Out Of Memory"

                                            error. The error occurred

                                            before the linear solver

                                            finished one step. So I

                                            don't have the full info

                                            from ksp_view. The info from

                                            ksp_view_pre is attached.<br>

                                            >>>>><br>

                                            >>>>> Okay -

                                            that is essentially useless

                                            (sorry)<br>

                                            >>>>><br>

                                            >>>>> It

                                            seems to me that the error

                                            occurred when the

                                            decomposition was going to

                                            be changed.<br>

                                            >>>>><br>

                                            >>>>> Based

                                            on what information?<br>

                                            >>>>> Running

                                            with -info would give us

                                            more clues, but will create

                                            a ton of output.<br>

                                            >>>>> Please

                                            try running the case which

                                            failed with -info<br>

                                            >>>>>  I had

                                            another test with a grid of

                                            1536*128*384 and the same

                                            process mesh as above. There

                                            was no error. The ksp_view

                                            info is attached for

                                            comparison.<br>

                                            >>>>> Thank

                                            you.<br>

                                            >>>>><br>

                                            >>>>><br>

                                            >>>>> [3]

                                            Here is my crude estimate of

                                            your memory usage.<br>

                                            >>>>> I'll

                                            target the biggest memory

                                            hogs only to get an order of

                                            magnitude estimate<br>

                                            >>>>><br>

                                            >>>>> * The

                                            Fine grid operator contains

                                            4223139840 non-zeros -->

                                            1.8 GB per MPI rank assuming

                                            double precision.<br>

                                            >>>>> The

                                            indices for the AIJ could

                                            amount to another 0.3 GB

                                            (assuming 32 bit integers)<br>

                                            >>>>><br>

                                            >>>>> * You

                                            use 5 levels of coarsening,

                                            so the other operators

                                            should represent

                                            (collectively)<br>

                                            >>>>> 2.1 / 8

                                            + 2.1/8^2 + 2.1/8^3 +

                                            2.1/8^4  ~ 300 MB per MPI

                                            rank on the communicator

                                            with 18432 ranks.<br>

                                            >>>>> The

                                            coarse grid should consume ~

                                            0.5 MB per MPI rank on the

                                            communicator with 18432

                                            ranks.<br>

                                            >>>>><br>

                                            >>>>> * You

                                            use a reduction factor of

                                            64, making the new

                                            communicator with 288 MPI

                                            ranks.<br>

                                            >>>>>

                                            PCTelescope will first

                                            gather a temporary matrix

                                            associated with your coarse

                                            level operator assuming a

                                            comm size of 288 living on

                                            the comm with size 18432.<br>

                                            >>>>> This

                                            matrix will require

                                            approximately 0.5 * 64 = 32

                                            MB per core on the 288

                                            ranks.<br>

                                            >>>>> This

                                            matrix is then used to form

                                            a new MPIAIJ matrix on the

                                            subcomm, thus require

                                            another 32 MB per rank.<br>

                                            >>>>> The

                                            temporary matrix is now

                                            destroyed.<br>

                                            >>>>><br>

                                            >>>>> *

                                            Because a DMDA is detected,

                                            a permutation matrix is

                                            assembled.<br>

                                            >>>>> This

                                            requires 2 doubles per point

                                            in the DMDA.<br>

                                            >>>>> Your

                                            coarse DMDA contains 92 x 16

                                            x 48 points.<br>

                                            >>>>> Thus

                                            the permutation matrix will

                                            require < 1 MB per MPI

                                            rank on the sub-comm.<br>

                                            >>>>><br>

                                            >>>>> *

                                            Lastly, the matrix is

                                            permuted. This uses

                                            MatPtAP(), but the resulting

                                            operator will have the same

                                            memory footprint as the

                                            unpermuted matrix (32 MB).

                                            At any stage in PCTelescope,

                                            only 2 operators of size 32

                                            MB are held in memory when

                                            the DMDA is provided.<br>

                                            >>>>><br>

                                            >>>>> From my

                                            rough estimates, the worst

                                            case memory foot print for

                                            any given core, given your

                                            options is approximately<br>

                                            >>>>> 2100 MB

                                            + 300 MB + 32 MB + 32 MB + 1

                                            MB  = 2465 MB<br>

                                            >>>>> This is

                                            way below 8 GB.<br>

                                            >>>>><br>

                                            >>>>> Note

                                            this estimate completely

                                            ignores:<br>

                                            >>>>> (1) the

                                            memory required for the

                                            restriction operator,<br>

                                            >>>>> (2) the

                                            potential growth in the

                                            number of non-zeros per row

                                            due to Galerkin coarsening

                                            (I wished -ksp_view_pre

                                            reported the output from

                                            MatView so we could see the

                                            number of non-zeros required

                                            by the coarse level

                                            operators)<br>

                                            >>>>> (3) all

                                            temporary vectors required

                                            by the CG solver, and those

                                            required by the smoothers.<br>

                                            >>>>> (4)

                                            internal memory allocated by

                                            MatPtAP<br>

                                            >>>>> (5)

                                            memory associated with IS's

                                            used within PCTelescope<br>

                                            >>>>><br>

                                            >>>>> So

                                            either I am completely off

                                            in my estimates, or you have

                                            not carefully estimated the

                                            memory usage of your

                                            application code. Hopefully

                                            others might examine/correct

                                            my rough estimates<br>

                                            >>>>><br>

                                            >>>>> Since I

                                            don't have your code I

                                            cannot access the latter.<br>

                                            >>>>> Since I

                                            don't have access to the

                                            same machine you are running

                                            on, I think we need to take

                                            a step back.<br>

                                            >>>>><br>

                                            >>>>> [1]

                                            What machine are you running

                                            on? Send me a URL if its

                                            available<br>

                                            >>>>><br>

                                            >>>>> [2]

                                            What discretization are you

                                            using? (I am guessing a

                                            scalar 7 point FD stencil)<br>

                                            >>>>> If it's

                                            a 7 point FD stencil, we

                                            should be able to examine

                                            the memory usage of your

                                            solver configuration using a

                                            standard, light weight

                                            existing PETSc example, run

                                            on your machine at the same

                                            scale.<br>

                                            >>>>> This

                                            would hopefully enable us to

                                            correctly evaluate the

                                            actual memory usage required

                                            by the solver configuration

                                            you are using.<br>

                                            >>>>><br>

                                            >>>>> Thanks,<br>

                                            >>>>>   Dave<br>

                                            >>>>><br>

                                            >>>>><br>

                                            >>>>> Frank<br>

                                            >>>>><br>

                                            >>>>><br>

                                            >>>>><br>

                                            >>>>><br>

                                            >>>>> On

                                            07/08/2016 10:38 PM, Dave

                                            May wrote:<br>

                                            >>>>>><br>

                                            >>>>>> On

                                            Saturday, 9 July 2016, frank

                                            <<a>hengjiew@uci.edu</a>>

                                            wrote:<br>

                                            >>>>>> Hi

                                            Barry and Dave,<br>

                                            >>>>>><br>

                                            >>>>>>

                                            Thank both of you for the

                                            advice.<br>

                                            >>>>>><br>

                                            >>>>>>

                                            @Barry<br>

                                            >>>>>> I

                                            made a mistake in the file

                                            names in last email. I

                                            attached the correct files

                                            this time.<br>

                                            >>>>>> For

                                            all the three tests,

                                            'Telescope' is used as the

                                            coarse preconditioner.<br>

                                            >>>>>><br>

                                            >>>>>> ==

                                            Test1:   Grid:

                                            1536*128*384,   Process

                                            Mesh: 48*4*12<br>

                                            >>>>>>

                                            Part of the memory usage: 

                                            Vector   125            124

                                            3971904     0.<br>

                                            >>>>>>   

                                                          Matrix   101

                                            101      9462372     0<br>

                                            >>>>>><br>

                                            >>>>>> ==

                                            Test2: Grid: 1536*128*384, 

                                             Process Mesh: 96*8*24<br>

                                            >>>>>>

                                            Part of the memory usage: 

                                            Vector   125            124

                                            681672     0.<br>

                                            >>>>>>   

                                                          Matrix   101

                                            101      1462180     0.<br>

                                            >>>>>><br>

                                            >>>>>> In

                                            theory, the memory usage in

                                            Test1 should be 8 times of

                                            Test2. In my case, it is

                                            about 6 times.<br>

                                            >>>>>><br>

                                            >>>>>> ==

                                            Test3: Grid: 3072*256*768, 

                                             Process Mesh: 96*8*24.

                                            Sub-domain per process:

                                            32*32*32<br>

                                            >>>>>>

                                            Here I get the out of memory

                                            error.<br>

                                            >>>>>><br>

                                            >>>>>> I

                                            tried to use -mg_coarse

                                            jacobi. In this way, I don't

                                            need to set

                                            -mg_coarse_ksp_type and

                                            -mg_coarse_pc_type

                                            explicitly, right?<br>

                                            >>>>>> The

                                            linear solver didn't work in

                                            this case. Petsc output some

                                            errors.<br>

                                            >>>>>><br>

                                            >>>>>>

                                            @Dave<br>

                                            >>>>>> In

                                            test3, I use only one

                                            instance of 'Telescope'. On

                                            the coarse mesh of

                                            'Telescope', I used LU as

                                            the preconditioner instead

                                            of SVD.<br>

                                            >>>>>> If

                                            my set the levels correctly,

                                            then on the last coarse mesh

                                            of MG where it calls

                                            'Telescope', the sub-domain

                                            per process is 2*2*2.<br>

                                            >>>>>> On

                                            the last coarse mesh of

                                            'Telescope', there is only

                                            one grid point per process.<br>

                                            >>>>>> I

                                            still got the OOM error. The

                                            detailed petsc option file

                                            is attached.<br>

                                            >>>>>><br>

                                            >>>>>> Do

                                            you understand the expected

                                            memory usage for the

                                            particular parallel LU

                                            implementation you are

                                            using? I don't (seriously).

                                            Replace LU with bjacobi and

                                            re-run this test. My point

                                            about solver debugging is

                                            still valid.<br>

                                            >>>>>><br>

                                            >>>>>> And

                                            please send the result of

                                            KSPView so we can see what

                                            is actually used in the

                                            computations<br>

                                            >>>>>><br>

                                            >>>>>>

                                            Thanks<br>

                                            >>>>>> 

                                             Dave<br>

                                            >>>>>><br>

                                            >>>>>><br>

                                            >>>>>>

                                            Thank you so much.<br>

                                            >>>>>><br>

                                            >>>>>>

                                            Frank<br>

                                            >>>>>><br>

                                            >>>>>><br>

                                            >>>>>><br>

                                            >>>>>> On

                                            07/06/2016 02:51 PM, Barry

                                            Smith wrote:<br>

                                            >>>>>> On

                                            Jul 6, 2016, at 4:19 PM,

                                            frank <<a>hengjiew@uci.edu</a>>

                                            wrote:<br>

                                            >>>>>><br>

                                            >>>>>> Hi

                                            Barry,<br>

                                            >>>>>><br>

                                            >>>>>>

                                            Thank you for you advice.<br>

                                            >>>>>> I

                                            tried three test. In the 1st

                                            test, the grid is

                                            3072*256*768 and the process

                                            mesh is 96*8*24.<br>

                                            >>>>>> The

                                            linear solver is 'cg' the

                                            preconditioner is 'mg' and

                                            'telescope' is used as the

                                            preconditioner at the coarse

                                            mesh.<br>

                                            >>>>>> The

                                            system gives me the "Out of

                                            Memory" error before the

                                            linear system is completely

                                            solved.<br>

                                            >>>>>> The

                                            info from '-ksp_view_pre' is

                                            attached. I seems to me that

                                            the error occurs when it

                                            reaches the coarse mesh.<br>

                                            >>>>>><br>

                                            >>>>>> The

                                            2nd test uses a grid of

                                            1536*128*384 and process

                                            mesh is 96*8*24. The 3rd   

                                                         test uses the

                                            same grid but a different

                                            process mesh 48*4*12.<br>

                                            >>>>>>   

                                             Are you sure this is right?

                                            The total matrix and vector

                                            memory usage goes from 2nd

                                            test<br>

                                            >>>>>>   

                                                        Vector   384   

                                                    383      8,193,712 

                                               0.<br>

                                            >>>>>>   

                                                        Matrix   103   

                                                    103     11,508,688 

                                               0.<br>

                                            >>>>>> to

                                            3rd test<br>

                                            >>>>>>   

                                                       Vector   384     

                                                  383      1,590,520   

                                             0.<br>

                                            >>>>>>   

                                                        Matrix   103   

                                                    103      3,508,664 

                                               0.<br>

                                            >>>>>>

                                            that is the memory usage got

                                            smaller but if you have only

                                            1/8th the processes and the

                                            same grid it should have

                                            gotten about 8 times bigger.

                                            Did you maybe cut the grid

                                            by a factor of 8 also? If so

                                            that still doesn't explain

                                            it because the memory usage

                                            changed by a factor of 5

                                            something for the vectors

                                            and 3 something for the

                                            matrices.<br>

                                            >>>>>><br>

                                            >>>>>><br>

                                            >>>>>> The

                                            linear solver and petsc

                                            options in 2nd and 3rd tests

                                            are the same in 1st test.

                                            The linear solver works fine

                                            in both test.<br>

                                            >>>>>> I

                                            attached the memory usage of

                                            the 2nd and 3rd tests. The

                                            memory info is from the

                                            option '-log_summary'. I

                                            tried to use '-momery_info'

                                            as you suggested, but in my

                                            case petsc treated it as an

                                            unused option. It output

                                            nothing about the memory. Do

                                            I need to add sth to my code

                                            so I can use '-memory_info'?<br>

                                            >>>>>>   

                                             Sorry, my mistake the

                                            option is -memory_view<br>

                                            >>>>>><br>

                                            >>>>>>   

                                            Can you run the one case

                                            with -memory_view and

                                            -mg_coarse jacobi

                                            -ksp_max_it 1 (just so it

                                            doesn't iterate forever) to

                                            see how much memory is used

                                            without the telescope? Also

                                            run case 2 the same way.<br>

                                            >>>>>><br>

                                            >>>>>>   

                                            Barry<br>

                                            >>>>>><br>

                                            >>>>>><br>

                                            >>>>>><br>

                                            >>>>>> In

                                            both tests the memory usage

                                            is not large.<br>

                                            >>>>>><br>

                                            >>>>>> It

                                            seems to me that it might be

                                            the 'telescope' 

                                            preconditioner that

                                            allocated a lot of memory

                                            and caused the error in the

                                            1st test.<br>

                                            >>>>>> Is

                                            there is a way to show how

                                            much memory it allocated?<br>

                                            >>>>>><br>

                                            >>>>>>

                                            Frank<br>

                                            >>>>>><br>

                                            >>>>>> On

                                            07/05/2016 03:37 PM, Barry

                                            Smith wrote:<br>

                                            >>>>>>   

                                            Frank,<br>

                                            >>>>>><br>

                                            >>>>>>   

                                              You can run with

                                            -ksp_view_pre to have it

                                            "view" the KSP before the

                                            solve so hopefully it gets

                                            that far.<br>

                                            >>>>>><br>

                                            >>>>>>   

                                               Please run the problem

                                            that does fit with

                                            -memory_info when the

                                            problem completes it will

                                            show the "high water mark"

                                            for PETSc allocated memory

                                            and total memory used. We

                                            first want to look at these

                                            numbers to see if it is

                                            using more memory than you

                                            expect. You could also run

                                            with say half the grid

                                            spacing to see how the

                                            memory usage scaled with the

                                            increase in grid points.

                                            Make the runs also with

                                            -log_view and send all the

                                            output from these options.<br>

                                            >>>>>><br>

                                            >>>>>>   

                                             Barry<br>

                                            >>>>>><br>

                                            >>>>>> On

                                            Jul 5, 2016, at 5:23 PM,

                                            frank <<a>hengjiew@uci.edu</a>>

                                            wrote:<br>

                                            >>>>>><br>

                                            >>>>>> Hi,<br>

                                            >>>>>><br>

                                            >>>>>> I

                                            am using the CG ksp solver

                                            and Multigrid

                                            preconditioner  to solve a

                                            linear system in parallel.<br>

                                            >>>>>> I

                                            chose to use the 'Telescope'

                                            as the preconditioner on the

                                            coarse mesh for its good

                                            performance.<br>

                                            >>>>>> The

                                            petsc options file is

                                            attached.<br>

                                            >>>>>><br>

                                            >>>>>> The

                                            domain is a 3d box.<br>

                                            >>>>>> It

                                            works well when the grid is 

                                            1536*128*384 and the process

                                            mesh is 96*8*24. When I

                                            double the size of grid and 

                                                               keep the

                                            same process mesh and petsc

                                            options, I get an "out of

                                            memory" error from the

                                            super-cluster I am using.<br>

                                            >>>>>>

                                            Each process has access to

                                            at least 8G memory, which

                                            should be more than enough

                                            for my application. I am

                                            sure that all the other

                                            parts of my code( except the

                                            linear solver ) do not use

                                            much memory. So I doubt if

                                            there is something wrong

                                            with the linear solver.<br>

                                            >>>>>> The

                                            error occurs before the

                                            linear system is completely

                                            solved so I don't have the

                                            info from ksp view. I am not

                                            able to re-produce the error

                                            with a smaller problem

                                            either.<br>

                                            >>>>>> In

                                            addition,  I tried to use

                                            the block jacobi as the

                                            preconditioner with the same

                                            grid and same decomposition.

                                            The linear solver runs

                                            extremely slow but there is

                                            no memory error.<br>

                                            >>>>>><br>

                                            >>>>>> How

                                            can I diagnose what exactly

                                            cause the error?<br>

                                            >>>>>>

                                            Thank you so much.<br>

                                            >>>>>><br>

                                            >>>>>>

                                            Frank<br>

                                            >>>>>>

                                            <petsc_options.txt><br>

                                            >>>>>>

                                            <ksp_view_pre.txt><memory_test<wbr>2.txt><memory_test3.txt><petsc<wbr>_options.txt><br>

                                            >>>>>><br>

                                            >>>>><br>

                                            >>>><br>

                                            >>>

                                            <ksp_view1.txt><ksp_view2.txt><wbr><ksp_view3.txt><memory1.txt><m<wbr>emory2.txt><petsc_options1.txt<wbr>><petsc_options2.txt><petsc_op<wbr>tions3.txt><br>

                                            ><br>

                                            <br>

                                          </blockquote>

                                        </div>

                                      </blockquote>

                                      <br>

                                    </div>

                                  </blockquote>

                                  <div> </div>

                                </blockquote>

                              </div>

                            </blockquote>

                            <br>

                          </div>

                        </div>

                      </div>

                    </blockquote>

                  </div>

                  <br>

                </div>

              </blockquote>

              <br>

            </div>

          </blockquote>

          <div> </div>

          <div> </div>

          <div> </div>

        </div>

      </div>

    </blockquote>

    <br>

  </div></div></div>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div>