[petsc-users] Memory growth issue

Mon Jun 3 18:35:20 CDT 2019

Junchao,
   I won't be feasible to share the code but I will run a similar test 
as you have done (large problem); I will
try with both MPICH and OpenMPI.  I also agree that deltas are not ideal 
as there they do not account for latency in the freeing of memory
etc.  But I will note when we have the memory growth issue latency 
associated with free( ) appears not to be in play since the total
memory footprint grows monotonically.

   I'll also have a look at massif.  If you figure out the interface, 
and can send me the lines to instrument the code with that will save me
some time.
-sanjay

On 6/3/19 3:17 PM, Zhang, Junchao wrote:
> Sanjay & Barry,
>   Sorry, I made a mistake that I said I could reproduced Sanjay's 
> experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() 
> when petsc is configured without debugging, I have to add -malloc to 
> run the program. 2) I have to instrument the code outside of 
> KSPSolve(). In my case, it is in SNESSolve_NEWTONLS. In old 
> experiments, I did it inside KSPSolve. Since KSPSolve can recursively 
> call KSPSolve, the old results were misleading.
>  With these fixes, I measured differences of RSS and Petsc malloc 
> before/after KSPSolve. I did experiments on MacBook 
> using src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c 
> with commands like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 
> -ts_type beuler -ts_max_steps 500 -malloc.
>  I find if the grid size is small, I can see a non-zero RSS-delta 
> randomly, either with one mpi rank or multiple ranks, with MPICH or 
> OpenMPI. If I increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 
> 256, I only see non-zero RSS-delta randomly at the first few 
> iterations (with MPICH or OpenMPI). When the computer workload is high 
> by simultaneously running ex5-openmpi and ex5-mpich, the MPICH one 
> pops up much more non-zero RSS-delta. But "Malloc Delta" behavior is 
> stable across all runs. There is only one nonzero malloc delta value 
> in the first KSPSolve call. All remaining are zero. Something like this:
>
>     mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type
>     beuler -ts_max_steps 500 -malloc
>     RSS Delta= 32489472, Malloc Delta=       26290304, RSS End=  136114176
>     RSS Delta=  32768, Malloc Delta=              0, RSS End=  138510336
>     RSS Delta=    0, Malloc Delta=              0, RSS End=  138522624
>     RSS Delta=    0, Malloc Delta=              0, RSS End=  138539008
>
> So I think I can conclude there is no unfreed memory in KSPSolve() 
> allocated by PETSc.  Has MPICH allocated unfreed memory in KSPSolve? 
> That is possible and I am trying to find a way like 
> PetscMallocGetCurrentUsage() to measure that. Also, I think RSS delta 
> is not a good way to measure memory allocation. It is dynamic and 
> depends on state of the computer (swap, shared libraries loaded etc) 
> when running the code. We should focus on malloc instead.  If there 
> was a valgrind tool, like performance profiling tools,  that can let 
> users measure memory allocated but not freed in a user specified code 
> segment, that would be very helpful in this case. But I have not found 
> one.
>
> Sanjay, did you say currently you could run with OpenMPI without out 
> of memory, but with MPICH, you ran out of memory? Is it feasible to 
> share your code so that I can test with? Thanks.
>
> --Junchao Zhang
>
> On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee <s_g at berkeley.edu 
> <mailto:s_g at berkeley.edu>> wrote:
>
>     Barry,
>
>     If you look at the graphs I generated (on my Mac),  you will see that
>     OpenMPI and MPICH have very different values (along with the fact
>     that
>     MPICH does not seem to adhere
>     to the standard (for releasing MPI_ISend resources following and
>     MPI_Wait).
>
>     -sanjay
>
>     PS: I agree with Barry's assessment; this is really not that
>     acceptable.
>
>     On 6/1/19 1:00 AM, Smith, Barry F. wrote:
>     >    Junchao,
>     >
>     >       This is insane. Either the OpenMPI library or something in
>     the OS underneath related to sockets and interprocess
>     communication is grabbing additional space for each round of MPI
>     communication!  Does MPICH have the same values or different
>     values than OpenMP? When you run on Linux do you get the same
>     values as Apple or different. --- Same values seem to indicate the
>     issue is inside OpenMPI/MPICH different values indicates problem
>     is more likely at the OS level. Does this happen only with the
>     default VecScatter that uses blocking MPI, what happens with
>     PetscSF under Vec? Is it somehow related to PETSc's use of
>     nonblocking sends and receives? One could presumably use valgrind
>     to see exactly what lines in what code are causing these
>     increases. I don't think we can just shrug and say this is the way
>     it is, we need to track down and understand the cause (and if
>     possible fix).
>     >
>     >    Barry
>     >
>     >
>     >> On May 31, 2019, at 2:53 PM, Zhang, Junchao
>     <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>     >>
>     >> Sanjay,
>     >> I tried petsc with MPICH and OpenMPI on my Macbook. I inserted
>     PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the
>     beginning and end of KSPSolve and then computed the delta and
>     summed over processes. Then I tested with
>     src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
>     >> With OpenMPI,
>     >> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler
>     -ts_max_steps 500 > 128.log
>     >> grep -n -v "RSS Delta=         0, Malloc Delta=      0" 128.log
>     >> 1:RSS Delta=     69632, Malloc Delta=         0
>     >> 2:RSS Delta=     69632, Malloc Delta=         0
>     >> 3:RSS Delta=     69632, Malloc Delta=         0
>     >> 4:RSS Delta=     69632, Malloc Delta=         0
>     >> 9:RSS Delta=9.25286e+06, Malloc Delta=         0
>     >> 22:RSS Delta=     49152, Malloc Delta=         0
>     >> 44:RSS Delta=     20480, Malloc Delta=         0
>     >> 53:RSS Delta=     49152, Malloc Delta=         0
>     >> 66:RSS Delta=      4096, Malloc Delta=         0
>     >> 97:RSS Delta=     16384, Malloc Delta=         0
>     >> 119:RSS Delta=     20480, Malloc Delta=         0
>     >> 141:RSS Delta=     53248, Malloc Delta=         0
>     >> 176:RSS Delta=     16384, Malloc Delta=         0
>     >> 308:RSS Delta=     16384, Malloc Delta=         0
>     >> 352:RSS Delta=     16384, Malloc Delta=         0
>     >> 550:RSS Delta=     16384, Malloc Delta=         0
>     >> 572:RSS Delta=     16384, Malloc Delta=         0
>     >> 669:RSS Delta=     40960, Malloc Delta=         0
>     >> 924:RSS Delta=     32768, Malloc Delta=         0
>     >> 1694:RSS Delta=     20480, Malloc Delta=         0
>     >> 2099:RSS Delta=     16384, Malloc Delta=         0
>     >> 2244:RSS Delta=     20480, Malloc Delta=         0
>     >> 3001:RSS Delta=     16384, Malloc Delta=         0
>     >> 5883:RSS Delta=     16384, Malloc Delta=         0
>     >>
>     >> If I increased the grid
>     >> mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler
>     -ts_max_steps 500 -malloc_test >512.log
>     >> grep -n -v "RSS Delta=         0, Malloc Delta=      0" 512.log
>     >> 1:RSS Delta=1.05267e+06, Malloc Delta=         0
>     >> 2:RSS Delta=1.05267e+06, Malloc Delta=         0
>     >> 3:RSS Delta=1.05267e+06, Malloc Delta=         0
>     >> 4:RSS Delta=1.05267e+06, Malloc Delta=         0
>     >> 13:RSS Delta=1.24932e+08, Malloc Delta=         0
>     >>
>     >> So we did see RSS increase in 4k-page sizes after KSPSolve. As
>     long as no memory leaks, why do you care about it? Is it because
>     you run out of memory?
>     >>
>     >> On Thu, May 30, 2019 at 1:59 PM Smith, Barry F.
>     <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>> wrote:
>     >>
>     >>     Thanks for the update. So the current conclusions are that
>     using the Waitall in your code
>     >>
>     >> 1) solves the memory issue with OpenMPI in your code
>     >>
>     >> 2) does not solve the memory issue with PETSc KSPSolve
>     >>
>     >> 3) MPICH has memory issues both for your code and PETSc
>     KSPSolve (despite) the wait all fix?
>     >>
>     >> If you literately just comment out the call to KSPSolve() with
>     OpenMPI is there no growth in memory usage?
>     >>
>     >>
>     >> Both 2 and 3 are concerning, indicate possible memory leak bugs
>     in MPICH and not freeing all MPI resources in KSPSolve()
>     >>
>     >> Junchao, can you please investigate 2 and 3 with, for example,
>     a TS example that uses the linear solver (like with -ts_type
>     beuler)? Thanks
>     >>
>     >>
>     >>    Barry
>     >>
>     >>
>     >>
>     >>> On May 30, 2019, at 1:47 PM, Sanjay Govindjee
>     <s_g at berkeley.edu <mailto:s_g at berkeley.edu>> wrote:
>     >>>
>     >>> Lawrence,
>     >>> Thanks for taking a look!  This is what I had been wondering
>     about -- my knowledge of MPI is pretty minimal and
>     >>> this origins of the routine were from a programmer we hired a
>     decade+ back from NERSC.  I'll have to look into
>     >>> VecScatter.  It will be great to dispense with our
>     roll-your-own routines (we even have our own reduceALL scattered
>     around the code).
>     >>>
>     >>> Interestingly, the MPI_WaitALL has solved the problem when
>     using OpenMPI but it still persists with MPICH.  Graphs attached.
>     >>> I'm going to run with openmpi for now (but I guess I really
>     still need to figure out what is wrong with MPICH and WaitALL;
>     >>> I'll try Barry's suggestion of
>     --download-mpich-configure-arguments="--enable-error-messages=all
>     --enable-g" later today and report back).
>     >>>
>     >>> Regarding MPI_Barrier, it was put in due a problem that some
>     processes were finishing up sending and receiving and exiting the
>     subroutine
>     >>> before the receiving processes had completed (which resulted
>     in data loss as the buffers are freed after the call to the
>     routine). MPI_Barrier was the solution proposed
>     >>> to us.  I don't think I can dispense with it, but will think
>     about some more.
>     >>>
>     >>> I'm not so sure about using MPI_IRecv as it will require a bit
>     of rewriting since right now I process the received
>     >>> data sequentially after each blocking MPI_Recv -- clearly
>     slower but easier to code.
>     >>>
>     >>> Thanks again for the help.
>     >>>
>     >>> -sanjay
>     >>>
>     >>> On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
>     >>>> Hi Sanjay,
>     >>>>
>     >>>>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users
>     <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>     >>>>>
>     >>>>> The problem seems to persist but with a different
>     signature.  Graphs attached as before.
>     >>>>>
>     >>>>> Totals with MPICH (NB: single run)
>     >>>>>
>     >>>>> For the CG/Jacobi data_exchange_total = 41,385,984;
>     kspsolve_total = 38,289,408
>     >>>>> For the GMRES/BJACOBI data_exchange_total = 41,324,544;
>     kspsolve_total = 41,324,544
>     >>>>>
>     >>>>> Just reading the MPI docs I am wondering if I need some sort
>     of MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange
>     routine?
>     >>>>> I would have thought that with the blocking receives and the
>     MPI_Barrier that everything will have fully completed and cleaned
>     up before
>     >>>>> all processes exited the routine, but perhaps I am wrong on
>     that.
>     >>>> Skimming the fortran code you sent you do:
>     >>>>
>     >>>> for i in ...:
>     >>>>     call MPI_Isend(..., req, ierr)
>     >>>>
>     >>>> for i in ...:
>     >>>>     call MPI_Recv(..., ierr)
>     >>>>
>     >>>> But you never call MPI_Wait on the request you got back from
>     the Isend. So the MPI library will never free the data structures
>     it created.
>     >>>>
>     >>>> The usual pattern for these non-blocking communications is to
>     allocate an array for the requests of length nsend+nrecv and then do:
>     >>>>
>     >>>> for i in nsend:
>     >>>>     call MPI_Isend(..., req[i], ierr)
>     >>>> for j in nrecv:
>     >>>>     call MPI_Irecv(..., req[nsend+j], ierr)
>     >>>>
>     >>>> call MPI_Waitall(req, ..., ierr)
>     >>>>
>     >>>> I note also there's no need for the Barrier at the end of the
>     routine, this kind of communication does neighbourwise
>     synchronisation, no need to add (unnecessary) global
>     synchronisation too.
>     >>>>
>     >>>> As an aside, is there a reason you don't use PETSc's
>     VecScatter to manage this global to local exchange?
>     >>>>
>     >>>> Cheers,
>     >>>>
>     >>>> Lawrence
>     >>>
>     <cg_mpichwall.png><cg_wall.png><gmres_mpichwall.png><gmres_wall.png>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190603/d297feb8/attachment-0001.html>