[petsc-users] Memory growth issue

Wed Jun 5 01:18:49 CDT 2019

Junchao,

   Attached is a graph of total RSS from my Mac using openmpi and mpich 
(installed with --download-openmpi and --download-mpich).

   The difference is pretty stark!  The WaitAll( ) in my part of the 
code fixed the run away memory
problem using openmpi but definitely not with mpich.

   Tomorrow I hope to get my linux box set up; unfortunately it needs an 
OS update :(
Then I can try to run there and reproduce the same (or find out it is a 
Mac quirk, though the
reason I started looking at this was that a use on an HPC system pointed 
it out to me).

-sanjay

PS: To generate the data, all I did was place a call to 
PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by an 
MPI_AllReduce( ) to sum across the job (4 processors).

On 6/4/19 4:27 PM, Zhang, Junchao wrote:
> Hi, Sanjay,
>   I managed to use Valgrind massif + MPICH master + PETSc master. I 
> ran ex5 500 time steps with "mpirun -n 4 valgrind --tool=massif 
> --max-snapshots=200 --detailed-freq=1 ./ex5 -da_grid_x 512 -da_grid_y 
> 512 -ts_type beuler -ts_max_steps 500 -malloc"
>   I visualized the output with massif-visualizer. From the attached 
> picture, we can see the total heap size keeps constant most of the 
> time and is NOT monotonically increasing.  We can also see MPI only 
> allocated memory at initialization time and kept it. So it is unlikely 
> that MPICH keeps allocating memory in each KSPSolve call.
>   From graphs you sent, I can only see RSS is randomly increased after 
> KSPSolve, but that does not mean heap size keeps increasing.  I 
> recommend you also profile your code with valgrind massif and 
> visualize it. I failed to install massif-visualizer on MacBook and 
> CentOS. But I easily got it installed on Ubuntu.
>   I want you to confirm that with the MPI_Waitall fix, you still run 
> out of memory with MPICH (but not OpenMPI).  If needed, I can hack 
> MPICH to get its current memory usage so that we can calculate its 
> difference after each KSPSolve call.
> massif-ex5.png
>
>
> --Junchao Zhang
>
>
> On Mon, Jun 3, 2019 at 6:36 PM Sanjay Govindjee <s_g at berkeley.edu 
> <mailto:s_g at berkeley.edu>> wrote:
>
>     Junchao,
>       I won't be feasible to share the code but I will run a similar
>     test as you have done (large problem); I will
>     try with both MPICH and OpenMPI.  I also agree that deltas are not
>     ideal as there they do not account for latency in the freeing of
>     memory
>     etc.  But I will note when we have the memory growth issue latency
>     associated with free( ) appears not to be in play since the total
>     memory footprint grows monotonically.
>
>       I'll also have a look at massif.  If you figure out the
>     interface, and can send me the lines to instrument the code with
>     that will save me
>     some time.
>     -sanjay
>     On 6/3/19 3:17 PM, Zhang, Junchao wrote:
>>     Sanjay & Barry,
>>       Sorry, I made a mistake that I said I could reproduced Sanjay's
>>     experiments. I found 1) to correctly use
>>     PetscMallocGetCurrentUsage() when petsc is configured without
>>     debugging, I have to add -malloc to run the program. 2) I have to
>>     instrument the code outside of KSPSolve(). In my case, it is
>>     in SNESSolve_NEWTONLS. In old experiments, I did it inside
>>     KSPSolve. Since KSPSolve can recursively call KSPSolve, the old
>>     results were misleading.
>>      With these fixes, I measured differences of RSS and Petsc malloc
>>     before/after KSPSolve. I did experiments on MacBook
>>     using src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
>>     with commands like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64
>>     -ts_type beuler -ts_max_steps 500 -malloc.
>>      I find if the grid size is small, I can see a non-zero RSS-delta
>>     randomly, either with one mpi rank or multiple ranks, with MPICH
>>     or OpenMPI. If I increase grid sizes, e.g., -da_grid_x 256
>>     -da_grid_y 256, I only see non-zero RSS-delta randomly at the
>>     first few iterations (with MPICH or OpenMPI). When the computer
>>     workload is high by simultaneously running ex5-openmpi and
>>     ex5-mpich, the MPICH one pops up much more non-zero RSS-delta.
>>     But "Malloc Delta" behavior is stable across all runs. There is
>>     only one nonzero malloc delta value in the first KSPSolve call.
>>     All remaining are zero. Something like this:
>>
>>         mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256
>>         -ts_type beuler -ts_max_steps 500 -malloc
>>         RSS Delta=     32489472, Malloc Delta=       26290304, RSS
>>         End=      136114176
>>         RSS Delta=        32768, Malloc Delta=              0, RSS
>>         End=      138510336
>>         RSS Delta=            0, Malloc Delta=              0, RSS
>>         End=      138522624
>>         RSS Delta=            0, Malloc Delta=              0, RSS
>>         End=      138539008
>>
>>     So I think I can conclude there is no unfreed memory in
>>     KSPSolve() allocated by PETSc.  Has MPICH allocated unfreed
>>     memory in KSPSolve? That is possible and I am trying to find a
>>     way like PetscMallocGetCurrentUsage() to measure that. Also, I
>>     think RSS delta is not a good way to measure memory allocation.
>>     It is dynamic and depends on state of the computer (swap, shared
>>     libraries loaded etc) when running the code. We should focus on
>>     malloc instead. If there was a valgrind tool, like performance
>>     profiling tools,  that can let users measure memory allocated but
>>     not freed in a user specified code segment, that would be very
>>     helpful in this case. But I have not found one.
>>
>>     Sanjay, did you say currently you could run with OpenMPI without
>>     out of memory, but with MPICH, you ran out of memory?  Is it
>>     feasible to share your code so that I can test with? Thanks.
>>
>>     --Junchao Zhang
>>
>>     On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee <s_g at berkeley.edu
>>     <mailto:s_g at berkeley.edu>> wrote:
>>
>>         Barry,
>>
>>         If you look at the graphs I generated (on my Mac), you will
>>         see that
>>         OpenMPI and MPICH have very different values (along with the
>>         fact that
>>         MPICH does not seem to adhere
>>         to the standard (for releasing MPI_ISend resources following
>>         and MPI_Wait).
>>
>>         -sanjay
>>
>>         PS: I agree with Barry's assessment; this is really not that
>>         acceptable.
>>
>>         On 6/1/19 1:00 AM, Smith, Barry F. wrote:
>>         >    Junchao,
>>         >
>>         >       This is insane. Either the OpenMPI library or
>>         something in the OS underneath related to sockets and
>>         interprocess communication is grabbing additional space for
>>         each round of MPI communication!  Does MPICH have the same
>>         values or different values than OpenMP? When you run on Linux
>>         do you get the same values as Apple or different. --- Same
>>         values seem to indicate the issue is inside OpenMPI/MPICH
>>         different values indicates problem is more likely at the OS
>>         level. Does this happen only with the default VecScatter that
>>         uses blocking MPI, what happens with PetscSF under Vec? Is it
>>         somehow related to PETSc's use of nonblocking sends and
>>         receives? One could presumably use valgrind to see exactly
>>         what lines in what code are causing these increases. I don't
>>         think we can just shrug and say this is the way it is, we
>>         need to track down and understand the cause (and if possible
>>         fix).
>>         >
>>         >    Barry
>>         >
>>         >
>>         >> On May 31, 2019, at 2:53 PM, Zhang, Junchao
>>         <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>>         >>
>>         >> Sanjay,
>>         >> I tried petsc with MPICH and OpenMPI on my Macbook. I
>>         inserted
>>         PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the
>>         beginning and end of KSPSolve and then computed the delta and
>>         summed over processes. Then I tested with
>>         src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
>>         >> With OpenMPI,
>>         >> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type
>>         beuler -ts_max_steps 500 > 128.log
>>         >> grep -n -v "RSS Delta=         0, Malloc Delta=         0"
>>         128.log
>>         >> 1:RSS Delta=     69632, Malloc Delta=    0
>>         >> 2:RSS Delta=     69632, Malloc Delta=    0
>>         >> 3:RSS Delta=     69632, Malloc Delta=    0
>>         >> 4:RSS Delta=     69632, Malloc Delta=    0
>>         >> 9:RSS Delta=9.25286e+06, Malloc Delta=    0
>>         >> 22:RSS Delta=     49152, Malloc Delta=    0
>>         >> 44:RSS Delta=     20480, Malloc Delta=    0
>>         >> 53:RSS Delta=     49152, Malloc Delta=    0
>>         >> 66:RSS Delta=      4096, Malloc Delta=    0
>>         >> 97:RSS Delta=     16384, Malloc Delta=    0
>>         >> 119:RSS Delta=     20480, Malloc Delta=      0
>>         >> 141:RSS Delta=     53248, Malloc Delta=      0
>>         >> 176:RSS Delta=     16384, Malloc Delta=      0
>>         >> 308:RSS Delta=     16384, Malloc Delta=      0
>>         >> 352:RSS Delta=     16384, Malloc Delta=      0
>>         >> 550:RSS Delta=     16384, Malloc Delta=      0
>>         >> 572:RSS Delta=     16384, Malloc Delta=      0
>>         >> 669:RSS Delta=     40960, Malloc Delta=      0
>>         >> 924:RSS Delta=     32768, Malloc Delta=      0
>>         >> 1694:RSS Delta=     20480, Malloc Delta=      0
>>         >> 2099:RSS Delta=     16384, Malloc Delta=      0
>>         >> 2244:RSS Delta=     20480, Malloc Delta=      0
>>         >> 3001:RSS Delta=     16384, Malloc Delta=      0
>>         >> 5883:RSS Delta=     16384, Malloc Delta=      0
>>         >>
>>         >> If I increased the grid
>>         >> mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type
>>         beuler -ts_max_steps 500 -malloc_test >512.log
>>         >> grep -n -v "RSS Delta=         0, Malloc Delta=         0"
>>         512.log
>>         >> 1:RSS Delta=1.05267e+06, Malloc Delta=    0
>>         >> 2:RSS Delta=1.05267e+06, Malloc Delta=    0
>>         >> 3:RSS Delta=1.05267e+06, Malloc Delta=    0
>>         >> 4:RSS Delta=1.05267e+06, Malloc Delta=    0
>>         >> 13:RSS Delta=1.24932e+08, Malloc Delta=      0
>>         >>
>>         >> So we did see RSS increase in 4k-page sizes after
>>         KSPSolve. As long as no memory leaks, why do you care about
>>         it? Is it because you run out of memory?
>>         >>
>>         >> On Thu, May 30, 2019 at 1:59 PM Smith, Barry F.
>>         <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>> wrote:
>>         >>
>>         >>     Thanks for the update. So the current conclusions are
>>         that using the Waitall in your code
>>         >>
>>         >> 1) solves the memory issue with OpenMPI in your code
>>         >>
>>         >> 2) does not solve the memory issue with PETSc KSPSolve
>>         >>
>>         >> 3) MPICH has memory issues both for your code and PETSc
>>         KSPSolve (despite) the wait all fix?
>>         >>
>>         >> If you literately just comment out the call to KSPSolve()
>>         with OpenMPI is there no growth in memory usage?
>>         >>
>>         >>
>>         >> Both 2 and 3 are concerning, indicate possible memory leak
>>         bugs in MPICH and not freeing all MPI resources in KSPSolve()
>>         >>
>>         >> Junchao, can you please investigate 2 and 3 with, for
>>         example, a TS example that uses the linear solver (like with
>>         -ts_type beuler)? Thanks
>>         >>
>>         >>
>>         >>    Barry
>>         >>
>>         >>
>>         >>
>>         >>> On May 30, 2019, at 1:47 PM, Sanjay Govindjee
>>         <s_g at berkeley.edu <mailto:s_g at berkeley.edu>> wrote:
>>         >>>
>>         >>> Lawrence,
>>         >>> Thanks for taking a look!  This is what I had been
>>         wondering about -- my knowledge of MPI is pretty minimal and
>>         >>> this origins of the routine were from a programmer we
>>         hired a decade+ back from NERSC.  I'll have to look into
>>         >>> VecScatter.  It will be great to dispense with our
>>         roll-your-own routines (we even have our own reduceALL
>>         scattered around the code).
>>         >>>
>>         >>> Interestingly, the MPI_WaitALL has solved the problem
>>         when using OpenMPI but it still persists with MPICH.  Graphs
>>         attached.
>>         >>> I'm going to run with openmpi for now (but I guess I
>>         really still need to figure out what is wrong with MPICH and
>>         WaitALL;
>>         >>> I'll try Barry's suggestion of
>>         --download-mpich-configure-arguments="--enable-error-messages=all
>>         --enable-g" later today and report back).
>>         >>>
>>         >>> Regarding MPI_Barrier, it was put in due a problem that
>>         some processes were finishing up sending and receiving and
>>         exiting the subroutine
>>         >>> before the receiving processes had completed (which
>>         resulted in data loss as the buffers are freed after the call
>>         to the routine). MPI_Barrier was the solution proposed
>>         >>> to us.  I don't think I can dispense with it, but will
>>         think about some more.
>>         >>>
>>         >>> I'm not so sure about using MPI_IRecv as it will require
>>         a bit of rewriting since right now I process the received
>>         >>> data sequentially after each blocking MPI_Recv -- clearly
>>         slower but easier to code.
>>         >>>
>>         >>> Thanks again for the help.
>>         >>>
>>         >>> -sanjay
>>         >>>
>>         >>> On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
>>         >>>> Hi Sanjay,
>>         >>>>
>>         >>>>> On 30 May 2019, at 08:58, Sanjay Govindjee via
>>         petsc-users <petsc-users at mcs.anl.gov
>>         <mailto:petsc-users at mcs.anl.gov>> wrote:
>>         >>>>>
>>         >>>>> The problem seems to persist but with a different
>>         signature.  Graphs attached as before.
>>         >>>>>
>>         >>>>> Totals with MPICH (NB: single run)
>>         >>>>>
>>         >>>>> For the CG/Jacobi data_exchange_total = 41,385,984;
>>         kspsolve_total = 38,289,408
>>         >>>>> For the GMRES/BJACOBI data_exchange_total = 41,324,544;
>>         kspsolve_total = 41,324,544
>>         >>>>>
>>         >>>>> Just reading the MPI docs I am wondering if I need some
>>         sort of MPI_Wait/MPI_Waitall before my MPI_Barrier in the
>>         data exchange routine?
>>         >>>>> I would have thought that with the blocking receives
>>         and the MPI_Barrier that everything will have fully completed
>>         and cleaned up before
>>         >>>>> all processes exited the routine, but perhaps I am
>>         wrong on that.
>>         >>>> Skimming the fortran code you sent you do:
>>         >>>>
>>         >>>> for i in ...:
>>         >>>>     call MPI_Isend(..., req, ierr)
>>         >>>>
>>         >>>> for i in ...:
>>         >>>>     call MPI_Recv(..., ierr)
>>         >>>>
>>         >>>> But you never call MPI_Wait on the request you got back
>>         from the Isend. So the MPI library will never free the data
>>         structures it created.
>>         >>>>
>>         >>>> The usual pattern for these non-blocking communications
>>         is to allocate an array for the requests of length
>>         nsend+nrecv and then do:
>>         >>>>
>>         >>>> for i in nsend:
>>         >>>>     call MPI_Isend(..., req[i], ierr)
>>         >>>> for j in nrecv:
>>         >>>>     call MPI_Irecv(..., req[nsend+j], ierr)
>>         >>>>
>>         >>>> call MPI_Waitall(req, ..., ierr)
>>         >>>>
>>         >>>> I note also there's no need for the Barrier at the end
>>         of the routine, this kind of communication does neighbourwise
>>         synchronisation, no need to add (unnecessary) global
>>         synchronisation too.
>>         >>>>
>>         >>>> As an aside, is there a reason you don't use PETSc's
>>         VecScatter to manage this global to local exchange?
>>         >>>>
>>         >>>> Cheers,
>>         >>>>
>>         >>>> Lawrence
>>         >>>
>>         <cg_mpichwall.png><cg_wall.png><gmres_mpichwall.png><gmres_wall.png>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190604/e6955808/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: massif-ex5.png
Type: image/png
Size: 457319 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190604/e6955808/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mac-mpich-openmpi.png
Type: image/png
Size: 47468 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190604/e6955808/attachment-0003.png>