<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, Oct 4, 2016 at 3:26 PM, frank <span dir="ltr"><<a href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000"><span class="">

    <br>

    <div class="m_-6949234570636630927moz-cite-prefix">On 10/04/2016 01:20 PM, Matthew Knepley

      wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">On Tue, Oct 4, 2016 at 3:09 PM, frank

            <span dir="ltr"><<a href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>></span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000">

                <div class="m_-6949234570636630927m_-8767381834923078048moz-cite-prefix">Hi

                  Dave,<br>

                  <br>

                  Thank you for the reply.<br>

                  What do you mean by the "nested calls to KSPSolve"?<br>

                </div>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>KSPSolve is called again after redistributing the

              computation.</div>

          </div>

        </div>

      </div>

    </blockquote>

    <br></span>

    I am still confused. There is only one KSPSolve in my code. <br></div></blockquote><div><br></div><div>Thats right. You call it once, but it is called internally again.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

    Do you mean KSPSolve is called again in the sub-communicator? If

    that's the case, even if I put two identical KSPSolve in the code,

    the sub-communicator is still going to call KSPSolve, right?</div></blockquote><div><br></div><div>Yes.</div><div><br></div><div>   Matt</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div><div class="h5">

    <blockquote type="cite">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">

            <div> </div>

            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000">

                <div class="m_-6949234570636630927m_-8767381834923078048moz-cite-prefix"> I

                  tried to call KSPSolve twice, but the the second solve

                  converged in 0 iteration. KSPSolve seems to remember

                  the solution. How can I force both solves start from

                  the same initial guess?<br>

                </div>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>Did you zero the solution vector between solves?

              VecSet(x, 0.0);</div>

            <div><br>

            </div>

            <div>  Matt</div>

            <div> </div>

            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000">

                <div class="m_-6949234570636630927m_-8767381834923078048moz-cite-prefix">

                  Thank you.<span class="m_-6949234570636630927HOEnZb"><font color="#888888"><br>

                      <br>

                      Frank</font></span>

                  <div>

                    <div class="m_-6949234570636630927h5"><br>

                      <br>

                      <br>

                      On 10/04/2016 12:56 PM, Dave May wrote:<br>

                    </div>

                  </div>

                </div>

                <div>

                  <div class="m_-6949234570636630927h5">

                    <blockquote type="cite"><br>

                      <br>

                      On Tuesday, 4 October 2016, frank <<a href="mailto:hengjiew@uci.edu" target="_blank">hengjiew@uci.edu</a>>

                      wrote:<br>

                      <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                        <div bgcolor="#FFFFFF" text="#000000">

                          <p>Hi,</p>

                          This question is follow-up of the thread

                          "Question about memory usage in Multigrid

                          preconditioner".<br>

                          I used to have the "Out of Memory(OOM)"

                          problem when using the CG+Telescope MG solver

                          with 32768 cores. Adding the "-matrap 0;

                          -matptap_scalable" option did solve that

                          problem. <br>

                          <br>

                          Then I test the scalability by solving a 3d

                          poisson eqn for 1 step. I used one

                          sub-communicator in all the tests. The

                          difference between the petsc options in those

                          tests are: 1 the

                          pc_telescope_reduction_factor; 2 the number of

                          multigrid levels in the up/down solver. The

                          function "ksp_solve" is timed. It is kind of

                          slow and doesn't scale at all. <br>

                          <br>

                          Test1: 512^3 grid points<br>

                          Core#        telescope_reduction_factor    <wbr>   

                          MG levels# for up/down solver     Time for

                          KSPSolve (s)<br>

                          512             8                             <wbr>                   

                          4 / 3                             <wbr>                

                          6.2466<br>

                          4096           64                            <wbr>                  

                          5 / 3                             <wbr>                

                          0.9361<br>

                          32768         64                            <wbr>                  

                          4 / 3                             <wbr>                

                          4.8914<br>

                          <br>

                          Test2: 1024^3 grid points<br>

                          Core#        telescope_reduction_factor    <wbr>   

                          MG levels# for up/down solver     Time for

                          KSPSolve (s)<br>

                          4096           64                            <wbr>                  

                          5 / 4                               <wbr>              

                          3.4139<br>

                          8192           128                           <wbr>                 

                          5 / 4                             <wbr>                

                          2.4196<br>

                          16384         32         

                                                        <wbr>       5 /

                          3                               <wbr>              

                          5.4150<br>

                          32768         64                            <wbr>                  

                          5 / 3                             <wbr>                

                          5.6067<br>

                          65536         128                           <wbr>                 

                          5 / 3                             <wbr>                

                          6.5219</div>

                      </blockquote>

                      <div><br>

                      </div>

                      <div>You have to be very careful how you interpret

                        these numbers. Your solver contains nested calls

                        to KSPSolve, and unfortunately as a result the

                        numbers you report include setup time. This will

                        remain true even if you call KSPSetUp on the

                        outermost KSP. </div>

                      <div><br>

                      </div>

                      <div>Your email concerns scalability of the silver

                        application, so let's focus on that issue.</div>

                      <div><br>

                      </div>

                      <div>The only way to clearly separate setup from

                        solve time is to perform two identical solves.

                        The second solve will not require any setup. You

                        should monitor the second solve via a new

                        PetscStage.</div>

                      <div><br>

                      </div>

                      <div>This was what I did in the telescope paper.

                        It was the only way to understand the setup cost

                        (and scaling) cf the solve time (and scaling).</div>

                      <div><br>

                      </div>

                      <div>Thanks</div>

                      <div>  Dave</div>

                      <div>

                        <div>

                          <div><br>

                          </div>

                          <div> </div>

                          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                            <div bgcolor="#FFFFFF" text="#000000"> I

                              guess I didn't set the MG levels properly.

                              What would be the efficient way to arrange

                              the MG levels?<br>

                              Also which preconditionr at the coarse

                              mesh of the 2nd communicator should I use

                              to improve the performance? <br>

                              <br>

                              I attached the test code and the petsc

                              options file for the 1024^3 cube with

                              32768 cores. <br>

                              <br>

                              Thank you.<br>

                              <br>

                              Regards,<br>

                              Frank<br>

                              <br>

                              <br>

                              <br>

                              <br>

                              <br>

                              <br>

                              <div>On 09/15/2016 03:35 AM, Dave May

                                wrote:<br>

                              </div>

                              <blockquote type="cite">

                                <div dir="ltr">

                                  <div>

                                    <div>

                                      <div>

                                        <div>

                                          <div>HI all,<br>

                                            <br>

                                          </div>

                                          <div>I the only unexpected

                                            memory usage I can see is

                                            associated with the call to

                                            MatPtAP().<br>

                                          </div>

                                          <div>Here is something you can

                                            try immediately.<br>

                                          </div>

                                        </div>

                                        Run your code with the

                                        additional options<br>

                                          -matrap 0 -matptap_scalable<br>

                                        <br>

                                      </div>

                                      <div>I didn't realize this before,

                                        but the default behaviour of

                                        MatPtAP in parallel is actually

                                        to to explicitly form the

                                        transpose of P (e.g. assemble R

                                        = P^T) and then compute R.A.P. <br>

                                        You don't want to do this. The

                                        option -matrap 0 resolves this

                                        issue.<br>

                                      </div>

                                      <div><br>

                                      </div>

                                      <div>The implementation of P^T.A.P

                                        has two variants. <br>

                                        The scalable implementation

                                        (with respect to memory usage)

                                        is selected via the second

                                        option -matptap_scalable.</div>

                                      <div><br>

                                      </div>

                                      <div>Try it out - I see a

                                        significant memory reduction

                                        using these options for

                                        particular mesh sizes /

                                        partitions.<br>

                                      </div>

                                      <div><br>

                                      </div>

                                      I've attached a cleaned up version

                                      of the code you sent me.<br>

                                    </div>

                                    There were a number of memory leaks

                                    and other issues.<br>

                                  </div>

                                  <div>The main points being<br>

                                  </div>

                                    * You should call

                                  DMDAVecGetArrayF90() before

                                  VecAssembly{Begin,End}<br>

                                    * You should call PetscFinalize(),

                                  otherwise the option -log_summary

                                  (-log_view) will not display anything

                                  once the program has completed.<br>

                                  <div>

                                    <div>

                                      <div><br>

                                        <br>

                                      </div>

                                      <div>Thanks,<br>

                                      </div>

                                      <div>  Dave<br>

                                      </div>

                                      <div>

                                        <div>

                                          <div><br>

                                          </div>

                                        </div>

                                      </div>

                                    </div>

                                  </div>

                                </div>

                                <div class="gmail_extra"><br>

                                  <div class="gmail_quote">On 15

                                    September 2016 at 08:03, Hengjie

                                    Wang <span dir="ltr"><<a>hengjiew@uci.edu</a>></span>

                                    wrote:<br>

                                    <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                                      <div bgcolor="#FFFFFF" text="#000000"> Hi Dave,<br>

                                        <br>

                                        Sorry, I should have put more

                                        comment to explain the code.  <br>

                                        The number of process in each

                                        dimension is the same: Px =

                                        Py=Pz=P. So is the domain size.<br>

                                        So if the you want to run the

                                        code for a  512^3 grid points on

                                        16^3 cores, you need to set "-N

                                        512 -P 16" in the command line.<br>

                                        I add more comments and also fix

                                        an error in the attached code. (

                                        The error only effects the

                                        accuracy of solution but not the

                                        memory usage. ) <br>

                                        <div><br>

                                          Thank you.<span><font color="#888888"><br>

                                              Frank</font></span>

                                          <div>

                                            <div><br>

                                              <br>

                                              On 9/14/2016 9:05 PM, Dave

                                              May wrote:<br>

                                            </div>

                                          </div>

                                        </div>

                                        <div>

                                          <div>

                                            <blockquote type="cite"><br>

                                              <br>

                                              On Thursday, 15 September

                                              2016, Dave May <<a>dave.mayhem23@gmail.com</a>>

                                              wrote:<br>

                                              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

                                                <br>

                                                On Thursday, 15

                                                September 2016, frank

                                                <<a>hengjiew@uci.edu</a>>

                                                wrote:<br>

                                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                                                  <div bgcolor="#FFFFFF" text="#000000"> Hi,

                                                    <br>

                                                    <br>

                                                    I write a simple

                                                    code to re-produce

                                                    the error. I hope

                                                    this can help to

                                                    diagnose the

                                                    problem.<br>

                                                    The code just solves

                                                    a 3d poisson

                                                    equation. </div>

                                                </blockquote>

                                                <div><br>

                                                </div>

                                                <div>Why is the stencil

                                                  width a runtime

                                                  parameter?? And why is

                                                  the default value 2?

                                                  For 7-pnt FD Laplace,

                                                  you only need

                                                  a stencil width of 1. </div>

                                                <div><br>

                                                </div>

                                                <div>Was this choice

                                                  made to mimic

                                                  something in the

                                                  real application code?</div>

                                              </blockquote>

                                              <div><br>

                                              </div>

                                              Please ignore - I

                                              misunderstood your usage

                                              of the param set by -P

                                              <div>

                                                <div> </div>

                                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                                                  <div> </div>

                                                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                                                    <div bgcolor="#FFFFFF" text="#000000"><br>

                                                      I run the code on

                                                      a 1024^3 mesh. The

                                                      process partition

                                                      is 32 * 32 * 32.

                                                      That's when I

                                                      re-produce the OOM

                                                      error. Each core

                                                      has about 2G

                                                      memory.<br>

                                                      I also run the

                                                      code on a 512^3

                                                      mesh with 16 * 16

                                                      * 16 processes.

                                                      The ksp solver

                                                      works fine. <br>

                                                      I attached the

                                                      code,

                                                      ksp_view_pre's

                                                      output and my

                                                      petsc option file.<br>

                                                      <br>

                                                      Thank you.<br>

                                                      Frank<br>

                                                      <div><br>

                                                        On 09/09/2016

                                                        06:38 PM,

                                                        Hengjie Wang

                                                        wrote:<br>

                                                      </div>

                                                      <blockquote type="cite">Hi

                                                        Barry, 

                                                        <div><br>

                                                        </div>

                                                        <div>I checked.

                                                          On the

                                                          supercomputer,

                                                          I had the

                                                          option

                                                          "-ksp_view_pre"

                                                          but it is not

                                                          in file I sent

                                                          you. I am

                                                          sorry for the

                                                          confusion.</div>

                                                        <div><br>

                                                        </div>

                                                        <div>Regards,</div>

                                                        <div>Frank<span></span><br>

                                                          <br>

                                                          On Friday,

                                                          September 9,

                                                          2016, Barry

                                                          Smith <<a>bsmith@mcs.anl.gov</a>>

                                                          wrote:<br>

                                                          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

                                                          > On Sep 9,

                                                          2016, at 3:11

                                                          PM, frank <<a>hengjiew@uci.edu</a>> wrote:<br>

                                                          ><br>

                                                          > Hi Barry,<br>

                                                          ><br>

                                                          > I think

                                                          the first KSP

                                                          view output is

                                                          from

                                                          -ksp_view_pre.

                                                          Before I

                                                          submitted the

                                                          test, I was

                                                          not sure

                                                          whether there

                                                          would be OOM

                                                          error or not.

                                                          So I added

                                                          both

                                                          -ksp_view_pre

                                                          and -ksp_view.<br>

                                                          <br>

                                                            But the

                                                          options file

                                                          you sent

                                                          specifically

                                                          does NOT list

                                                          the

                                                          -ksp_view_pre

                                                          so how could

                                                          it be from

                                                          that?<br>

                                                          <br>

                                                             Sorry to be

                                                          pedantic but

                                                          I've spent too

                                                          much time in

                                                          the past

                                                          trying to

                                                          debug from

                                                          incorrect

                                                          information

                                                          and want to

                                                          make sure that

                                                          the

                                                          information I

                                                          have is

                                                          correct before

                                                          thinking.

                                                          Please recheck

                                                          exactly what

                                                          happened.

                                                          Rerun with the

                                                          exact input

                                                          file you

                                                          emailed if

                                                          that is

                                                          needed.<br>

                                                          <br>

                                                             Barry<br>

                                                          <br>

                                                          ><br>

                                                          > Frank<br>

                                                          ><br>

                                                          ><br>

                                                          > On

                                                          09/09/2016

                                                          12:38 PM,

                                                          Barry Smith

                                                          wrote:<br>

                                                          >>   Why

                                                          does

                                                          ksp_view2.txt

                                                          have two KSP

                                                          views in it

                                                          while

                                                          ksp_view1.txt

                                                          has only one

                                                          KSPView in it?

                                                          Did you run

                                                          two different

                                                          solves in the

                                                          2 case but not

                                                          the one?<br>

                                                          >><br>

                                                          >> 

                                                           Barry<br>

                                                          >><br>

                                                          >><br>

                                                          >><br>

                                                          >>>

                                                          On Sep 9,

                                                          2016, at 10:56

                                                          AM, frank <<a>hengjiew@uci.edu</a>> wrote:<br>

                                                          >>><br>

                                                          >>>

                                                          Hi,<br>

                                                          >>><br>

                                                          >>> I

                                                          want to

                                                          continue

                                                          digging into

                                                          the memory

                                                          problem here.<br>

                                                          >>> I

                                                          did find a

                                                          work around in

                                                          the past,

                                                          which is to

                                                          use less cores

                                                          per node so

                                                          that each core

                                                          has 8G memory.

                                                          However this

                                                          is deficient

                                                          and expensive.

                                                          I hope to

                                                          locate the

                                                          place that

                                                          uses the most

                                                          memory.<br>

                                                          >>><br>

                                                          >>>

                                                          Here is a

                                                          brief summary

                                                          of the tests I

                                                          did in past:<br>

>>>> Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12<br>

                                                          >>>

                                                          Maximum (over

                                                          computational

                                                          time) process

                                                          memory:       

                                                             total

                                                          7.0727e+08<br>

                                                          >>>

                                                          Current

                                                          process

                                                          memory:       

                                                                 total

                                                          7.0727e+08<br>

                                                          >>>

                                                          Maximum (over

                                                          computational

                                                          time) space

                                                          PetscMalloc()ed: 

                                                          total

                                                          6.3908e+11<br>

                                                          >>>

                                                          Current space

PetscMalloc()ed:                                                total

                                                          1.8275e+09<br>

                                                          >>><br>

>>>> Test2:    Mesh 1536*128*384  |  Process Mesh 96*8*24<br>

                                                          >>>

                                                          Maximum (over

                                                          computational

                                                          time) process

                                                          memory:       

                                                             total

                                                          5.9431e+09<br>

                                                          >>>

                                                          Current

                                                          process

                                                          memory:       

                                                                 total

                                                          5.9431e+09<br>

                                                          >>>

                                                          Maximum (over

                                                          computational

                                                          time) space

                                                          PetscMalloc()ed: 

                                                          total

                                                          5.3202e+12<br>

                                                          >>>

                                                          Current space

PetscMalloc()ed:                                                 total

                                                          5.4844e+09<br>

                                                          >>><br>

>>>> Test3:    Mesh 3072*256*768  |  Process Mesh 96*8*24<br>

                                                          >>> 

                                                             OOM( Out Of

                                                          Memory )

                                                          killer of the

                                                          supercomputer

                                                          terminated the

                                                          job during

                                                          "KSPSolve".<br>

                                                          >>><br>

                                                          >>> I

                                                          attached the

                                                          output of

                                                          ksp_view( the

                                                          third test's

                                                          output is from

                                                          ksp_view_pre

                                                          ), memory_view

                                                          and also the

                                                          petsc options.<br>

                                                          >>><br>

                                                          >>>

                                                          In all the

                                                          tests, each

                                                          core can

                                                          access about

                                                          2G memory. In

                                                          test3, there

                                                          are 4223139840

                                                          non-zeros in

                                                          the matrix.

                                                          This will

                                                          consume about

                                                          1.74M, using

                                                          double

                                                          precision.

                                                          Considering

                                                          some extra

                                                          memory used to

                                                          store integer

                                                          index, 2G

                                                          memory should

                                                          still be way

                                                          enough.<br>

                                                          >>><br>

                                                          >>>

                                                          Is there a way

                                                          to find out

                                                          which part of

                                                          KSPSolve uses

                                                          the most

                                                          memory?<br>

                                                          >>>

                                                          Thank you so

                                                          much.<br>

                                                          >>><br>

                                                          >>>

                                                          BTW, there are

                                                          4 options

                                                          remains unused

                                                          and I don't

                                                          understand why

                                                          they are

                                                          omitted:<br>

                                                          >>>

                                                          -mg_coarse_telescope_mg_coarse<wbr>_ksp_type

                                                          value: preonly<br>

                                                          >>>

                                                          -mg_coarse_telescope_mg_coarse<wbr>_pc_type

                                                          value: bjacobi<br>

                                                          >>>

                                                          -mg_coarse_telescope_mg_levels<wbr>_ksp_max_it

                                                          value: 1<br>

                                                          >>>

                                                          -mg_coarse_telescope_mg_levels<wbr>_ksp_type

                                                          value:

                                                          richardson<br>

                                                          >>><br>

                                                          >>><br>

                                                          >>>

                                                          Regards,<br>

                                                          >>>

                                                          Frank<br>

                                                          >>><br>

                                                          >>>

                                                          On 07/13/2016

                                                          05:47 PM, Dave

                                                          May wrote:<br>

>>>><br>

>>>> On 14 July 2016 at 01:07, frank <<a>hengjiew@uci.edu</a>>

                                                          wrote:<br>

>>>> Hi Dave,<br>

>>>><br>

>>>> Sorry for the late reply.<br>

>>>> Thank you so much for your detailed reply.<br>

>>>><br>

>>>> I have a question about the estimation of the memory

                                                          usage. There

                                                          are 4223139840

                                                          allocated

                                                          non-zeros and

                                                          18432 MPI

                                                          processes.

                                                          Double

                                                          precision is

                                                          used. So the

                                                          memory per

                                                          process is:<br>

>>>>   4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?<br>

>>>> Did I do sth wrong here? Because this seems too small.<br>

>>>><br>

>>>> No - I totally f***ed it up. You are correct. That'll

                                                          teach me for

                                                          fumbling

                                                          around with my

                                                          iphone

                                                          calculator and

                                                          not using my

                                                          brain. (Note

                                                          that to

                                                          convert to MB

                                                          just divide by

                                                          1e6, not

                                                          1024^2 -

                                                          although I

                                                          apparently

                                                          cannot convert

                                                          between units

                                                          correctly....)<br>

>>>><br>

>>>> From the PETSc objects associated with the solver, It

                                                          looks like it

                                                          _should_ run

                                                          with 2GB per

                                                          MPI rank.

                                                          Sorry for my

                                                          mistake.

                                                          Possibilities

                                                          are: somewhere

                                                          in your usage

                                                          of PETSc

                                                          you've

                                                          introduced a

                                                          memory leak;

                                                          PETSc is doing

                                                          a huge over

                                                          allocation

                                                          (e.g. as per

                                                          our discussion

                                                          of MatPtAP);

                                                          or in your

                                                          application

                                                          code there are

                                                          other objects

                                                          you have

                                                          forgotten to

                                                          log the memory

                                                          for.<br>

>>>><br>

>>>><br>

>>>><br>

>>>> I am running this job on Bluewater<br>

>>>> I am using the 7 points FD stencil in 3D.<br>

>>>><br>

>>>> I thought so on both counts.<br>

>>>><br>

>>>> I apologize that I made a stupid mistake in computing

                                                          the memory per

                                                          core. My

                                                          settings

                                                          render each

                                                          core can

                                                          access only 2G

                                                          memory on

                                                          average

                                                          instead of 8G

                                                          which I

                                                          mentioned in

                                                          previous

                                                          email. I

                                                          re-run the job

                                                          with 8G memory

                                                          per core on

                                                          average and

                                                          there is no

                                                          "Out Of

                                                          Memory" error.

                                                          I would do

                                                          more test to

                                                          see if there

                                                          is still some

                                                          memory issue.<br>

>>>><br>

>>>> Ok. I'd still like to know where the memory was being

                                                          used since my

                                                          estimates were

                                                          off.<br>

>>>><br>

>>>><br>

>>>> Thanks,<br>

>>>>   Dave<br>

>>>><br>

>>>> Regards,<br>

>>>> Frank<br>

>>>><br>

>>>><br>

>>>><br>

>>>> On 07/11/2016 01:18 PM, Dave May wrote:<br>

>>>>> Hi Frank,<br>

>>>>><br>

>>>>><br>

>>>>> On 11 July 2016 at 19:14, frank <<a>hengjiew@uci.edu</a>>

                                                          wrote:<br>

>>>>> Hi Dave,<br>

>>>>><br>

>>>>> I re-run the test using bjacobi as the

                                                          preconditioner

                                                          on the coarse

                                                          mesh of

                                                          telescope. The

                                                          Grid is

                                                          3072*256*768

                                                          and process

                                                          mesh is

                                                          96*8*24. The

                                                          petsc option

                                                          file is

                                                          attached.<br>

>>>>> I still got the "Out Of Memory" error. The error

                                                          occurred

                                                          before the

                                                          linear solver

                                                          finished one

                                                          step. So I

                                                          don't have the

                                                          full info from

                                                          ksp_view. The

                                                          info from

                                                          ksp_view_pre

                                                          is attached.<br>

>>>>><br>

>>>>> Okay - that is essentially useless (sorry)<br>

>>>>><br>

>>>>> It seems to me that the error occurred when the

                                                          decomposition

                                                          was going to

                                                          be changed.<br>

>>>>><br>

>>>>> Based on what information?<br>

>>>>> Running with -info would give us more clues, but

                                                          will create a

                                                          ton of output.<br>

>>>>> Please try running the case which failed with -info<br>

>>>>>  I had another test with a grid of 1536*128*384 and

                                                          the same

                                                          process mesh

                                                          as above.

                                                          There was no

                                                          error. The

                                                          ksp_view info

                                                          is attached

                                                          for

                                                          comparison.<br>

>>>>> Thank you.<br>

>>>>><br>

>>>>><br>

>>>>> [3] Here is my crude estimate of your memory usage.<br>

>>>>> I'll target the biggest memory hogs only to get an

                                                          order of

                                                          magnitude

                                                          estimate<br>

>>>>><br>

>>>>> * The Fine grid operator contains 4223139840

                                                          non-zeros

                                                          --> 1.8 GB

                                                          per MPI rank

                                                          assuming

                                                          double

                                                          precision.<br>

>>>>> The indices for the AIJ could amount to another 0.3

                                                          GB (assuming

                                                          32 bit

                                                          integers)<br>

>>>>><br>

>>>>> * You use 5 levels of coarsening, so the other

                                                          operators

                                                          should

                                                          represent

                                                          (collectively)<br>

>>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300 MB per

                                                          MPI rank on

                                                          the

                                                          communicator

                                                          with 18432

                                                          ranks.<br>

>>>>> The coarse grid should consume ~ 0.5 MB per MPI

                                                          rank on the

                                                          communicator

                                                          with 18432

                                                          ranks.<br>

>>>>><br>

>>>>> * You use a reduction factor of 64, making the new

                                                          communicator

                                                          with 288 MPI

                                                          ranks.<br>

>>>>> PCTelescope will first gather a temporary matrix

                                                          associated

                                                          with your

                                                          coarse level

                                                          operator

                                                          assuming a

                                                          comm size of

                                                          288 living on

                                                          the comm with

                                                          size 18432.<br>

>>>>> This matrix will require approximately 0.5 * 64 =

                                                          32 MB per core

                                                          on the 288

                                                          ranks.<br>

>>>>> This matrix is then used to form a new MPIAIJ

                                                          matrix on the

                                                          subcomm, thus

                                                          require

                                                          another 32 MB

                                                          per rank.<br>

>>>>> The temporary matrix is now destroyed.<br>

>>>>><br>

>>>>> * Because a DMDA is detected, a permutation matrix

                                                          is assembled.<br>

>>>>> This requires 2 doubles per point in the DMDA.<br>

>>>>> Your coarse DMDA contains 92 x 16 x 48 points.<br>

>>>>> Thus the permutation matrix will require < 1 MB

                                                          per MPI rank

                                                          on the

                                                          sub-comm.<br>

>>>>><br>

>>>>> * Lastly, the matrix is permuted. This uses

                                                          MatPtAP(), but

                                                          the resulting

                                                          operator will

                                                          have the same

                                                          memory

                                                          footprint as

                                                          the unpermuted

                                                          matrix (32

                                                          MB). At any

                                                          stage in

                                                          PCTelescope,

                                                          only 2

                                                          operators of

                                                          size 32 MB are

                                                          held in memory

                                                          when the DMDA

                                                          is provided.<br>

>>>>><br>

>>>>> From my rough estimates, the worst case memory foot

                                                          print for any

                                                          given core,

                                                          given your

                                                          options is

                                                          approximately<br>

>>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB<br>

>>>>> This is way below 8 GB.<br>

>>>>><br>

>>>>> Note this estimate completely ignores:<br>

>>>>> (1) the memory required for the restriction

                                                          operator,<br>

>>>>> (2) the potential growth in the number of non-zeros

                                                          per row due to

                                                          Galerkin

                                                          coarsening (I

                                                          wished

                                                          -ksp_view_pre

                                                          reported the

                                                          output from

                                                          MatView so we

                                                          could see the

                                                          number of

                                                          non-zeros

                                                          required by

                                                          the coarse

                                                          level

                                                          operators)<br>

>>>>> (3) all temporary vectors required by the CG

                                                          solver, and

                                                          those required

                                                          by the

                                                          smoothers.<br>

>>>>> (4) internal memory allocated by MatPtAP<br>

>>>>> (5) memory associated with IS's used within

                                                          PCTelescope<br>

>>>>><br>

>>>>> So either I am completely off in my estimates, or

                                                          you have not

                                                          carefully

                                                          estimated the

                                                          memory usage

                                                          of your

                                                          application

                                                          code.

                                                          Hopefully

                                                          others might

                                                          examine/correct

                                                          my rough

                                                          estimates<br>

>>>>><br>

>>>>> Since I don't have your code I cannot access the

                                                          latter.<br>

>>>>> Since I don't have access to the same machine you

                                                          are running

                                                          on, I think we

                                                          need to take a

                                                          step back.<br>

>>>>><br>

>>>>> [1] What machine are you running on? Send me a URL

                                                          if its

                                                          available<br>

>>>>><br>

>>>>> [2] What discretization are you using? (I am

                                                          guessing a

                                                          scalar 7 point

                                                          FD stencil)<br>

>>>>> If it's a 7 point FD stencil, we should be able to

                                                          examine the

                                                          memory usage

                                                          of your solver

                                                          configuration

                                                          using a

                                                          standard,

                                                          light weight

                                                          existing PETSc

                                                          example, run

                                                          on your

                                                          machine at the

                                                          same scale.<br>

>>>>> This would hopefully enable us to correctly

                                                          evaluate the

                                                          actual memory

                                                          usage required

                                                          by the solver

                                                          configuration

                                                          you are using.<br>

>>>>><br>

>>>>> Thanks,<br>

>>>>>   Dave<br>

>>>>><br>

>>>>><br>

>>>>> Frank<br>

>>>>><br>

>>>>><br>

>>>>><br>

>>>>><br>

>>>>> On 07/08/2016 10:38 PM, Dave May wrote:<br>

>>>>>><br>

>>>>>> On Saturday, 9 July 2016, frank <<a>hengjiew@uci.edu</a>>

                                                          wrote:<br>

>>>>>> Hi Barry and Dave,<br>

>>>>>><br>

>>>>>> Thank both of you for the advice.<br>

>>>>>><br>

>>>>>> @Barry<br>

>>>>>> I made a mistake in the file names in last

                                                          email. I

                                                          attached the

                                                          correct files

                                                          this time.<br>

>>>>>> For all the three tests, 'Telescope' is used as

                                                          the coarse

                                                          preconditioner.<br>

>>>>>><br>

>>>>>> == Test1:   Grid: 1536*128*384,   Process Mesh:

                                                          48*4*12<br>

>>>>>> Part of the memory usage:  Vector   125       

                                                              124

                                                          3971904     0.<br>

>>>>>>                                             

                                                          Matrix   101

                                                          101     

                                                          9462372     0<br>

>>>>>><br>

>>>>>> == Test2: Grid: 1536*128*384,   Process Mesh:

                                                          96*8*24<br>

>>>>>> Part of the memory usage:  Vector   125       

                                                              124

                                                          681672     0.<br>

>>>>>>                                             

                                                          Matrix   101

                                                          101     

                                                          1462180     0.<br>

>>>>>><br>

>>>>>> In theory, the memory usage in Test1 should be

                                                          8 times of

                                                          Test2. In my

                                                          case, it is

                                                          about 6 times.<br>

>>>>>><br>

>>>>>> == Test3: Grid: 3072*256*768,   Process Mesh:

                                                          96*8*24.

                                                          Sub-domain per

                                                          process:

                                                          32*32*32<br>

>>>>>> Here I get the out of memory error.<br>

>>>>>><br>

>>>>>> I tried to use -mg_coarse jacobi. In this way,

                                                          I don't need

                                                          to set

                                                          -mg_coarse_ksp_type

                                                          and

                                                          -mg_coarse_pc_type

                                                          explicitly,

                                                          right?<br>

>>>>>> The linear solver didn't work in this case.

                                                          Petsc output

                                                          some errors.<br>

>>>>>><br>

>>>>>> @Dave<br>

>>>>>> In test3, I use only one instance of

                                                          'Telescope'.

                                                          On the coarse

                                                          mesh of

                                                          'Telescope', I

                                                          used LU as the

                                                          preconditioner

                                                          instead of

                                                          SVD.<br>

>>>>>> If my set the levels correctly, then on the

                                                          last coarse

                                                          mesh of MG

                                                          where it calls

                                                          'Telescope',

                                                          the sub-domain

                                                          per process is

                                                          2*2*2.<br>

>>>>>> On the last coarse mesh of 'Telescope', there

                                                          is only one

                                                          grid point per

                                                          process.<br>

>>>>>> I still got the OOM error. The detailed petsc

                                                          option file is

                                                          attached.<br>

>>>>>><br>

>>>>>> Do you understand the expected memory usage for

                                                          the particular

                                                          parallel LU

                                                          implementation

                                                          you are using?

                                                          I don't

                                                          (seriously).

                                                          Replace LU

                                                          with bjacobi

                                                          and re-run

                                                          this test. My

                                                          point about

                                                          solver

                                                          debugging is

                                                          still valid.<br>

>>>>>><br>

>>>>>> And please send the result of KSPView so we can

                                                          see what is

                                                          actually used

                                                          in the

                                                          computations<br>

>>>>>><br>

>>>>>> Thanks<br>

>>>>>>   Dave<br>

>>>>>><br>

>>>>>><br>

>>>>>> Thank you so much.<br>

>>>>>><br>

>>>>>> Frank<br>

>>>>>><br>

>>>>>><br>

>>>>>><br>

>>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:<br>

>>>>>> On Jul 6, 2016, at 4:19 PM, frank <<a>hengjiew@uci.edu</a>>

                                                          wrote:<br>

>>>>>><br>

>>>>>> Hi Barry,<br>

>>>>>><br>

>>>>>> Thank you for you advice.<br>

>>>>>> I tried three test. In the 1st test, the grid

                                                          is

                                                          3072*256*768

                                                          and the

                                                          process mesh

                                                          is 96*8*24.<br>

>>>>>> The linear solver is 'cg' the preconditioner is

                                                          'mg' and

                                                          'telescope' is

                                                          used as the

                                                          preconditioner

                                                          at the coarse

                                                          mesh.<br>

>>>>>> The system gives me the "Out of Memory" error

                                                          before the

                                                          linear system

                                                          is completely

                                                          solved.<br>

>>>>>> The info from '-ksp_view_pre' is attached. I

                                                          seems to me

                                                          that the error

                                                          occurs when it

                                                          reaches the

                                                          coarse mesh.<br>

>>>>>><br>

>>>>>> The 2nd test uses a grid of 1536*128*384 and

                                                          process mesh

                                                          is 96*8*24.

                                                          The 3rd       

                                                                   test

                                                          uses the same

                                                          grid but a

                                                          different

                                                          process mesh

                                                          48*4*12.<br>

>>>>>>     Are you sure this is right? The total

                                                          matrix and

                                                          vector memory

                                                          usage goes

                                                          from 2nd test<br>

>>>>>>                Vector   384            383     

                                                          8,193,712   

                                                           0.<br>

>>>>>>                Matrix   103            103   

                                                           11,508,688   

                                                           0.<br>

>>>>>> to 3rd test<br>

>>>>>>               Vector   384            383     

                                                          1,590,520   

                                                           0.<br>

>>>>>>                Matrix   103            103     

                                                          3,508,664   

                                                           0.<br>

>>>>>> that is the memory usage got smaller but if you

                                                          have only

                                                          1/8th the

                                                          processes and

                                                          the same grid

                                                          it should have

                                                          gotten about 8

                                                          times bigger.

                                                          Did you maybe

                                                          cut the grid

                                                          by a factor of

                                                          8 also? If so

                                                          that still

                                                          doesn't

                                                          explain it

                                                          because the

                                                          memory usage

                                                          changed by a

                                                          factor of 5

                                                          something for

                                                          the vectors

                                                          and 3

                                                          something for

                                                          the matrices.<br>

>>>>>><br>

>>>>>><br>

>>>>>> The linear solver and petsc options in 2nd and

                                                          3rd tests are

                                                          the same in

                                                          1st test. The

                                                          linear solver

                                                          works fine in

                                                          both test.<br>

>>>>>> I attached the memory usage of the 2nd and 3rd

                                                          tests. The

                                                          memory info is

                                                          from the

                                                          option

                                                          '-log_summary'.

                                                          I tried to use

                                                          '-momery_info'

                                                          as you

                                                          suggested, but

                                                          in my case

                                                          petsc treated

                                                          it as an

                                                          unused option.

                                                          It output

                                                          nothing about

                                                          the memory. Do

                                                          I need to add

                                                          sth to my code

                                                          so I can use

                                                          '-memory_info'?<br>

>>>>>>     Sorry, my mistake the option is

                                                          -memory_view<br>

>>>>>><br>

>>>>>>    Can you run the one case with -memory_view

                                                          and -mg_coarse

                                                          jacobi

                                                          -ksp_max_it 1

                                                          (just so it

                                                          doesn't

                                                          iterate

                                                          forever) to

                                                          see how much

                                                          memory is used

                                                          without the

                                                          telescope?

                                                          Also run case

                                                          2 the same

                                                          way.<br>

>>>>>><br>

>>>>>>    Barry<br>

>>>>>><br>

>>>>>><br>

>>>>>><br>

>>>>>> In both tests the memory usage is not large.<br>

>>>>>><br>

>>>>>> It seems to me that it might be the

                                                          'telescope' 

                                                          preconditioner

                                                          that allocated

                                                          a lot of

                                                          memory and

                                                          caused the

                                                          error in the

                                                          1st test.<br>

>>>>>> Is there is a way to show how much memory it

                                                          allocated?<br>

>>>>>><br>

>>>>>> Frank<br>

>>>>>><br>

>>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote:<br>

>>>>>>    Frank,<br>

>>>>>><br>

>>>>>>      You can run with -ksp_view_pre to have it

                                                          "view" the KSP

                                                          before the

                                                          solve so

                                                          hopefully it

                                                          gets that far.<br>

>>>>>><br>

>>>>>>       Please run the problem that does fit with

                                                          -memory_info

                                                          when the

                                                          problem

                                                          completes it

                                                          will show the

                                                          "high water

                                                          mark" for

                                                          PETSc

                                                          allocated

                                                          memory and

                                                          total memory

                                                          used. We first

                                                          want to look

                                                          at these

                                                          numbers to see

                                                          if it is using

                                                          more memory

                                                          than you

                                                          expect. You

                                                          could also run

                                                          with say half

                                                          the grid

                                                          spacing to see

                                                          how the memory

                                                          usage scaled

                                                          with the

                                                          increase in

                                                          grid points.

                                                          Make the runs

                                                          also with

                                                          -log_view and

                                                          send all the

                                                          output from

                                                          these options.<br>

>>>>>><br>

>>>>>>     Barry<br>

>>>>>><br>

>>>>>> On Jul 5, 2016, at 5:23 PM, frank <<a>hengjiew@uci.edu</a>>

                                                          wrote:<br>

>>>>>><br>

>>>>>> Hi,<br>

>>>>>><br>

>>>>>> I am using the CG ksp solver and Multigrid

                                                          preconditioner 

                                                          to solve a

                                                          linear system

                                                          in parallel.<br>

>>>>>> I chose to use the 'Telescope' as the

                                                          preconditioner

                                                          on the coarse

                                                          mesh for its

                                                          good

                                                          performance.<br>

>>>>>> The petsc options file is attached.<br>

>>>>>><br>

>>>>>> The domain is a 3d box.<br>

>>>>>> It works well when the grid is  1536*128*384

                                                          and the

                                                          process mesh

                                                          is 96*8*24.

                                                          When I double

                                                          the size of

                                                          grid and     

                                                           keep the same

                                                          process mesh

                                                          and petsc

                                                          options, I get

                                                          an "out of

                                                          memory" error

                                                          from the

                                                          super-cluster

                                                          I am using.<br>

>>>>>> Each process has access to at least 8G memory,

                                                          which should

                                                          be more than

                                                          enough for my

                                                          application. I

                                                          am sure that

                                                          all the other

                                                          parts of my

                                                          code( except

                                                          the linear

                                                          solver ) do

                                                          not use much

                                                          memory. So I

                                                          doubt if there

                                                          is something

                                                          wrong with the

                                                          linear solver.<br>

>>>>>> The error occurs before the linear system is

                                                          completely

                                                          solved so I

                                                          don't have the

                                                          info from ksp

                                                          view. I am not

                                                          able to

                                                          re-produce the

                                                          error with a

                                                          smaller

                                                          problem

                                                          either.<br>

>>>>>> In addition,  I tried to use the block jacobi

                                                          as the

                                                          preconditioner

                                                          with the same

                                                          grid and same

                                                          decomposition.

                                                          The linear

                                                          solver runs

                                                          extremely slow

                                                          but there is

                                                          no memory

                                                          error.<br>

>>>>>><br>

>>>>>> How can I diagnose what exactly cause the

                                                          error?<br>

>>>>>> Thank you so much.<br>

>>>>>><br>

>>>>>> Frank<br>

>>>>>> <petsc_options.txt><br>

>>>>>> <ksp_view_pre.txt><memory_test<wbr>2.txt><memory_test3.txt><petsc<wbr>_options.txt><br>

>>>>>><br>

>>>>><br>

>>>><br>

                                                          >>>

                                                          <ksp_view1.txt><ksp_view2.txt><wbr><ksp_view3.txt><memory1.txt><m<wbr>emory2.txt><petsc_options1.txt<wbr>><petsc_options2.txt><petsc_op<wbr>tions3.txt><br>

                                                          ><br>

                                                          <br>

                                                          </blockquote>

                                                        </div>

                                                      </blockquote>

                                                      <br>

                                                    </div>

                                                  </blockquote>

                                                  <div> </div>

                                                </blockquote>

                                              </div>

                                            </blockquote>

                                            <br>

                                          </div>

                                        </div>

                                      </div>

                                    </blockquote>

                                  </div>

                                  <br>

                                </div>

                              </blockquote>

                              <br>

                            </div>

                          </blockquote>

                          <div> </div>

                          <div> </div>

                          <div> </div>

                        </div>

                      </div>

                    </blockquote>

                    <br>

                  </div>

                </div>

              </div>

            </blockquote>

          </div>

          <br>

          <br clear="all">

          <div><br>

          </div>

          -- <br>

          <div class="m_-6949234570636630927gmail_signature" data-smartmail="gmail_signature">What

            most experimenters take for granted before they begin their

            experiments is infinitely more interesting than any results

            to which their experiments lead.<br>

            -- Norbert Wiener</div>

        </div>

      </div>

    </blockquote>

    <br>

  </div></div></div>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div>