<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Junchao,<br>

    <br>

      Attached is a graph of total RSS from my Mac using openmpi and

    mpich (installed with --download-openmpi and --download-mpich).<br>

    <br>

      The difference is pretty stark!  The WaitAll( ) in my part of the

    code fixed the run away memory <br>

    problem using openmpi but definitely not with mpich.<br>

    <br>

      Tomorrow I hope to get my linux box set up; unfortunately it needs

    an OS update :(<br>

    Then I can try to run there and reproduce the same (or find out it

    is a Mac quirk, though the<br>

    reason I started looking at this was that a use on an HPC system

    pointed it out to me).<br>

    <br>

    -sanjay<br>

    <br>

    PS: To generate the data, all I did was place a call to

    PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by

    an MPI_AllReduce( ) to sum across the job (4 processors).<br>

    <pre class="moz-signature" cols="72">

</pre>

    <div class="moz-cite-prefix">On 6/4/19 4:27 PM, Zhang, Junchao

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CA+MQGp-GEq1D1tykJ1L6VVy6K0fuzUh3eJP=ZFafbPnBh8DPVg@mail.gmail.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <div dir="ltr">Hi, Sanjay,

        <div>  I managed to use Valgrind massif + MPICH master + PETSc

          master. I ran ex5 500 time steps with "mpirun -n 4 valgrind

          --tool=massif --max-snapshots=200 --detailed-freq=1 ./ex5

          -da_grid_x 512 -da_grid_y 512 -ts_type beuler -ts_max_steps

          500 -malloc"</div>

        <div>  I visualized the output with massif-visualizer. From the

          attached picture, we can see the total heap size keeps

          constant most of the time and is NOT

          monotonically increasing.  We can also see MPI only allocated

          memory at initialization time and kept it. So it is unlikely

          that MPICH keeps allocating memory in each KSPSolve call.</div>

        <div>  From graphs you sent, I can only see RSS is randomly

          increased after KSPSolve, but that does not mean heap size

          keeps increasing.  I recommend you also profile your code with

          valgrind massif and visualize it. I failed to install

          massif-visualizer on MacBook and CentOS. But I easily got it

          installed on Ubuntu.</div>

        <div>  I want you to confirm that with the MPI_Waitall fix, you

          still run out of memory with MPICH (but not OpenMPI).  If

          needed, I can hack MPICH to get its current memory usage so

          that we can calculate its difference after each KSPSolve call.</div>

        <div> </div>

        <div>

          <div><img src="cid:part1.B91F8C7C.750093BF@berkeley.edu"

              alt="massif-ex5.png" class="" width="562" height="382"><br>

          </div>

        </div>

        <div><br>

        </div>

        <div><br clear="all">

          <div>

            <div dir="ltr" class="gmail_signature"

              data-smartmail="gmail_signature">

              <div dir="ltr">--Junchao Zhang</div>

            </div>

          </div>

          <br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Mon, Jun 3, 2019 at 6:36 PM

          Sanjay Govindjee <<a href="mailto:s_g@berkeley.edu"

            moz-do-not-send="true">s_g@berkeley.edu</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div bgcolor="#FFFFFF">Junchao,<br>

              I won't be feasible to share the code but I will run a

            similar test as you have done (large problem); I will<br>

            try with both MPICH and OpenMPI.  I also agree that deltas

            are not ideal as there they do not account for latency in

            the freeing of memory<br>

            etc.  But I will note when we have the memory growth issue

            latency associated with free( ) appears not to be in play

            since the total<br>

            memory footprint grows monotonically.<br>

            <br>

              I'll also have a look at massif.  If you figure out the

            interface, and can send me the lines to instrument the code

            with that will save me<br>

            some time.<br>

            -sanjay<br>

            <div class="gmail-m_3580431406949325625moz-cite-prefix">On

              6/3/19 3:17 PM, Zhang, Junchao wrote:<br>

            </div>

            <blockquote type="cite">

              <div dir="ltr">Sanjay & Barry,

                <div>  Sorry, I made a mistake that I said I could

                  reproduced Sanjay's experiments. I found 1) to

                  correctly use PetscMallocGetCurrentUsage() when petsc

                  is configured without debugging, I have to add -malloc

                  to run the program. 2) I have to instrument the code

                  outside of KSPSolve(). In my case, it is

                  in SNESSolve_NEWTONLS. In old experiments, I did it

                  inside KSPSolve. Since KSPSolve can recursively call

                  KSPSolve, the old results were misleading.</div>

                <div> With these fixes, I measured differences of RSS

                  and Petsc malloc before/after KSPSolve. I did

                  experiments on MacBook

                  using src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c

                  with commands like mpirun -n 4 ./ex5 -da_grid_x 64

                  -da_grid_y 64 -ts_type beuler -ts_max_steps 500

                  -malloc.</div>

                <div> I find if the grid size is small, I can see a

                  non-zero RSS-delta randomly, either with one mpi rank

                  or multiple ranks, with MPICH or OpenMPI. If I

                  increase grid sizes, e.g., -da_grid_x 256 -da_grid_y

                  256, I only see non-zero RSS-delta randomly at the

                  first few iterations (with MPICH or OpenMPI). When the

                  computer workload is high by simultaneously running

                  ex5-openmpi and ex5-mpich, the MPICH one pops up much

                  more non-zero RSS-delta. But "Malloc Delta" behavior

                  is stable across all runs. There is only one nonzero

                  malloc delta value in the first KSPSolve call. All

                  remaining are zero. Something like this:</div>

                <blockquote style="margin:0px 0px 0px

                  40px;border:none;padding:0px">

                  <div><font face="courier new, monospace">mpirun -n 4

                      ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type

                      beuler -ts_max_steps 500 -malloc</font></div>

                  <div><font face="courier new, monospace">RSS Delta=  

                          32489472, Malloc Delta=       26290304, RSS

                      End=      136114176</font></div>

                  <div><font face="courier new, monospace">RSS Delta=  

                             32768, Malloc Delta=              0, RSS

                      End=      138510336</font></div>

                  <div><font face="courier new, monospace">RSS Delta=  

                                 0, Malloc Delta=              0, RSS

                      End=      138522624</font></div>

                  <div><font face="courier new, monospace">RSS Delta=  

                                 0, Malloc Delta=              0, RSS

                      End=      138539008</font></div>

                </blockquote>

                <div>So I think I can conclude there is no unfreed

                  memory in KSPSolve() allocated by PETSc.  Has MPICH

                  allocated unfreed memory in KSPSolve? That is possible

                  and I am trying to find a way like

                  PetscMallocGetCurrentUsage() to measure that. Also, I

                  think RSS delta is not a good way to measure memory

                  allocation. It is dynamic and depends on state of the

                  computer (swap, shared libraries loaded etc) when

                  running the code. We should focus on malloc instead. 

                  If there was a valgrind tool, like performance

                  profiling tools,  that can let users measure memory

                  allocated but not freed in a user specified code

                  segment, that would be very helpful in this case. But

                  I have not found one.<br>

                </div>

                <div><br>

                </div>

                <div>Sanjay, did you say currently you could run with

                  OpenMPI without out of memory, but with MPICH, you ran

                  out of memory?  Is it feasible to share your code so

                  that I can test with? Thanks.</div>

                <div><br>

                </div>

                <div>--Junchao Zhang<br>

                </div>

                <div><br>

                </div>

                <div class="gmail_quote">

                  <div dir="ltr" class="gmail_attr">On Sat, Jun 1, 2019

                    at 3:21 AM Sanjay Govindjee <<a

                      href="mailto:s_g@berkeley.edu" target="_blank"

                      moz-do-not-send="true">s_g@berkeley.edu</a>>

                    wrote:<br>

                  </div>

                  <blockquote class="gmail_quote" style="margin:0px 0px

                    0px 0.8ex;border-left:1px solid

                    rgb(204,204,204);padding-left:1ex">

                    Barry,<br>

                    <br>

                    If you look at the graphs I generated (on my Mac), 

                    you will see that <br>

                    OpenMPI and MPICH have very different values (along

                    with the fact that <br>

                    MPICH does not seem to adhere<br>

                    to the standard (for releasing MPI_ISend resources

                    following and MPI_Wait).<br>

                    <br>

                    -sanjay<br>

                    <br>

                    PS: I agree with Barry's assessment; this is really

                    not that acceptable.<br>

                    <br>

                    On 6/1/19 1:00 AM, Smith, Barry F. wrote:<br>

                    >    Junchao,<br>

                    ><br>

                    >       This is insane. Either the OpenMPI

                    library or something in the OS underneath related to

                    sockets and interprocess communication is grabbing

                    additional space for each round of MPI

                    communication!  Does MPICH have the same values or

                    different values than OpenMP? When you run on Linux

                    do you get the same values as Apple or different.

                    --- Same values seem to indicate the issue is inside

                    OpenMPI/MPICH different values indicates problem is

                    more likely at the OS level. Does this happen only

                    with the default VecScatter that uses blocking MPI,

                    what happens with PetscSF under Vec? Is it somehow

                    related to PETSc's use of nonblocking sends and

                    receives? One could presumably use valgrind to see

                    exactly what lines in what code are causing these

                    increases. I don't think we can just shrug and say

                    this is the way it is, we need to track down and

                    understand the cause (and if possible fix).<br>

                    ><br>

                    >    Barry<br>

                    ><br>

                    ><br>

                    >> On May 31, 2019, at 2:53 PM, Zhang, Junchao

                    <<a href="mailto:jczhang@mcs.anl.gov"

                      target="_blank" moz-do-not-send="true">jczhang@mcs.anl.gov</a>>

                    wrote:<br>

                    >><br>

                    >> Sanjay,<br>

                    >> I tried petsc with MPICH and OpenMPI on my

                    Macbook. I inserted

                    PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage

                    at the beginning and end of KSPSolve and then

                    computed the delta and summed over processes. Then I

                    tested with

                    src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c<br>

                    >> With OpenMPI,<br>

                    >> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y

                    128 -ts_type beuler -ts_max_steps 500 > 128.log<br>

                    >> grep -n -v "RSS Delta=         0, Malloc

                    Delta=         0" 128.log<br>

                    >> 1:RSS Delta=     69632, Malloc Delta=     

                       0<br>

                    >> 2:RSS Delta=     69632, Malloc Delta=     

                       0<br>

                    >> 3:RSS Delta=     69632, Malloc Delta=     

                       0<br>

                    >> 4:RSS Delta=     69632, Malloc Delta=     

                       0<br>

                    >> 9:RSS Delta=9.25286e+06, Malloc Delta=     

                       0<br>

                    >> 22:RSS Delta=     49152, Malloc Delta=     

                       0<br>

                    >> 44:RSS Delta=     20480, Malloc Delta=     

                       0<br>

                    >> 53:RSS Delta=     49152, Malloc Delta=     

                       0<br>

                    >> 66:RSS Delta=      4096, Malloc Delta=     

                       0<br>

                    >> 97:RSS Delta=     16384, Malloc Delta=     

                       0<br>

                    >> 119:RSS Delta=     20480, Malloc Delta=   

                         0<br>

                    >> 141:RSS Delta=     53248, Malloc Delta=   

                         0<br>

                    >> 176:RSS Delta=     16384, Malloc Delta=   

                         0<br>

                    >> 308:RSS Delta=     16384, Malloc Delta=   

                         0<br>

                    >> 352:RSS Delta=     16384, Malloc Delta=   

                         0<br>

                    >> 550:RSS Delta=     16384, Malloc Delta=   

                         0<br>

                    >> 572:RSS Delta=     16384, Malloc Delta=   

                         0<br>

                    >> 669:RSS Delta=     40960, Malloc Delta=   

                         0<br>

                    >> 924:RSS Delta=     32768, Malloc Delta=   

                         0<br>

                    >> 1694:RSS Delta=     20480, Malloc Delta=   

                         0<br>

                    >> 2099:RSS Delta=     16384, Malloc Delta=   

                         0<br>

                    >> 2244:RSS Delta=     20480, Malloc Delta=   

                         0<br>

                    >> 3001:RSS Delta=     16384, Malloc Delta=   

                         0<br>

                    >> 5883:RSS Delta=     16384, Malloc Delta=   

                         0<br>

                    >><br>

                    >> If I increased the grid<br>

                    >> mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y

                    512 -ts_type beuler -ts_max_steps 500 -malloc_test

                    >512.log<br>

                    >> grep -n -v "RSS Delta=         0, Malloc

                    Delta=         0" 512.log<br>

                    >> 1:RSS Delta=1.05267e+06, Malloc Delta=     

                       0<br>

                    >> 2:RSS Delta=1.05267e+06, Malloc Delta=     

                       0<br>

                    >> 3:RSS Delta=1.05267e+06, Malloc Delta=     

                       0<br>

                    >> 4:RSS Delta=1.05267e+06, Malloc Delta=     

                       0<br>

                    >> 13:RSS Delta=1.24932e+08, Malloc Delta=   

                         0<br>

                    >><br>

                    >> So we did see RSS increase in 4k-page sizes

                    after KSPSolve. As long as no memory leaks, why do

                    you care about it? Is it because you run out of

                    memory?<br>

                    >><br>

                    >> On Thu, May 30, 2019 at 1:59 PM Smith,

                    Barry F. <<a href="mailto:bsmith@mcs.anl.gov"

                      target="_blank" moz-do-not-send="true">bsmith@mcs.anl.gov</a>>

                    wrote:<br>

                    >><br>

                    >>     Thanks for the update. So the current

                    conclusions are that using the Waitall in your code<br>

                    >><br>

                    >> 1) solves the memory issue with OpenMPI in

                    your code<br>

                    >><br>

                    >> 2) does not solve the memory issue with

                    PETSc KSPSolve<br>

                    >><br>

                    >> 3) MPICH has memory issues both for your

                    code and PETSc KSPSolve (despite) the wait all fix?<br>

                    >><br>

                    >> If you literately just comment out the call

                    to KSPSolve() with OpenMPI is there no growth in

                    memory usage?<br>

                    >><br>

                    >><br>

                    >> Both 2 and 3 are concerning, indicate

                    possible memory leak bugs in MPICH and not freeing

                    all MPI resources in KSPSolve()<br>

                    >><br>

                    >> Junchao, can you please investigate 2 and 3

                    with, for example, a TS example that uses the linear

                    solver (like with -ts_type beuler)? Thanks<br>

                    >><br>

                    >><br>

                    >>    Barry<br>

                    >><br>

                    >><br>

                    >><br>

                    >>> On May 30, 2019, at 1:47 PM, Sanjay

                    Govindjee <<a href="mailto:s_g@berkeley.edu"

                      target="_blank" moz-do-not-send="true">s_g@berkeley.edu</a>>

                    wrote:<br>

                    >>><br>

                    >>> Lawrence,<br>

                    >>> Thanks for taking a look!  This is what

                    I had been wondering about -- my knowledge of MPI is

                    pretty minimal and<br>

                    >>> this origins of the routine were from a

                    programmer we hired a decade+ back from NERSC.  I'll

                    have to look into<br>

                    >>> VecScatter.  It will be great to

                    dispense with our roll-your-own routines (we even

                    have our own reduceALL scattered around the code).<br>

                    >>><br>

                    >>> Interestingly, the MPI_WaitALL has

                    solved the problem when using OpenMPI but it still

                    persists with MPICH.  Graphs attached.<br>

                    >>> I'm going to run with openmpi for now

                    (but I guess I really still need to figure out what

                    is wrong with MPICH and WaitALL;<br>

                    >>> I'll try Barry's suggestion of

                    --download-mpich-configure-arguments="--enable-error-messages=all

                    --enable-g" later today and report back).<br>

                    >>><br>

                    >>> Regarding MPI_Barrier, it was put in

                    due a problem that some processes were finishing up

                    sending and receiving and exiting the subroutine<br>

                    >>> before the receiving processes had

                    completed (which resulted in data loss as the

                    buffers are freed after the call to the routine).

                    MPI_Barrier was the solution proposed<br>

                    >>> to us.  I don't think I can dispense

                    with it, but will think about some more.<br>

                    >>><br>

                    >>> I'm not so sure about using MPI_IRecv

                    as it will require a bit of rewriting since right

                    now I process the received<br>

                    >>> data sequentially after each blocking

                    MPI_Recv -- clearly slower but easier to code.<br>

                    >>><br>

                    >>> Thanks again for the help.<br>

                    >>><br>

                    >>> -sanjay<br>

                    >>><br>

                    >>> On 5/30/19 4:48 AM, Lawrence Mitchell

                    wrote:<br>

                    >>>> Hi Sanjay,<br>

                    >>>><br>

                    >>>>> On 30 May 2019, at 08:58,

                    Sanjay Govindjee via petsc-users <<a

                      href="mailto:petsc-users@mcs.anl.gov"

                      target="_blank" moz-do-not-send="true">petsc-users@mcs.anl.gov</a>>

                    wrote:<br>

                    >>>>><br>

                    >>>>> The problem seems to persist

                    but with a different signature.  Graphs attached as

                    before.<br>

                    >>>>><br>

                    >>>>> Totals with MPICH (NB: single

                    run)<br>

                    >>>>><br>

                    >>>>> For the CG/Jacobi         

                    data_exchange_total = 41,385,984; kspsolve_total =

                    38,289,408<br>

                    >>>>> For the GMRES/BJACOBI     

                    data_exchange_total = 41,324,544; kspsolve_total =

                    41,324,544<br>

                    >>>>><br>

                    >>>>> Just reading the MPI docs I am

                    wondering if I need some sort of

                    MPI_Wait/MPI_Waitall before my MPI_Barrier in the

                    data exchange routine?<br>

                    >>>>> I would have thought that with

                    the blocking receives and the MPI_Barrier that

                    everything will have fully completed and cleaned up

                    before<br>

                    >>>>> all processes exited the

                    routine, but perhaps I am wrong on that.<br>

                    >>>> Skimming the fortran code you sent

                    you do:<br>

                    >>>><br>

                    >>>> for i in ...:<br>

                    >>>>     call MPI_Isend(..., req, ierr)<br>

                    >>>><br>

                    >>>> for i in ...:<br>

                    >>>>     call MPI_Recv(..., ierr)<br>

                    >>>><br>

                    >>>> But you never call MPI_Wait on the

                    request you got back from the Isend. So the MPI

                    library will never free the data structures it

                    created.<br>

                    >>>><br>

                    >>>> The usual pattern for these

                    non-blocking communications is to allocate an array

                    for the requests of length nsend+nrecv and then do:<br>

                    >>>><br>

                    >>>> for i in nsend:<br>

                    >>>>     call MPI_Isend(..., req[i],

                    ierr)<br>

                    >>>> for j in nrecv:<br>

                    >>>>     call MPI_Irecv(...,

                    req[nsend+j], ierr)<br>

                    >>>><br>

                    >>>> call MPI_Waitall(req, ..., ierr)<br>

                    >>>><br>

                    >>>> I note also there's no need for the

                    Barrier at the end of the routine, this kind of

                    communication does neighbourwise synchronisation, no

                    need to add (unnecessary) global synchronisation

                    too.<br>

                    >>>><br>

                    >>>> As an aside, is there a reason you

                    don't use PETSc's VecScatter to manage this global

                    to local exchange?<br>

                    >>>><br>

                    >>>> Cheers,<br>

                    >>>><br>

                    >>>> Lawrence<br>

                    >>>

<cg_mpichwall.png><cg_wall.png><gmres_mpichwall.png><gmres_wall.png><br>

                    <br>

                  </blockquote>

                </div>

              </div>

            </blockquote>

            <br>

          </div>

        </blockquote>

      </div>

    </blockquote>

    <br>

  </body>

</html>