[petsc-users] Memory growth issue

Matthew Knepley knepley at gmail.com
Mon Jun 3 18:17:18 CDT 2019


On Mon, Jun 3, 2019 at 6:56 PM Zhang, Junchao via petsc-users <
petsc-users at mcs.anl.gov> wrote:

> On Mon, Jun 3, 2019 at 5:23 PM Stefano Zampini <stefano.zampini at gmail.com>
> wrote:
>
>>
>>
>> On Jun 4, 2019, at 1:17 AM, Zhang, Junchao via petsc-users <
>> petsc-users at mcs.anl.gov> wrote:
>>
>> Sanjay & Barry,
>>   Sorry, I made a mistake that I said I could reproduced Sanjay's
>> experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() when
>> petsc is configured without debugging, I have to add -malloc to run the
>> program. 2) I have to instrument the code outside of KSPSolve(). In my
>> case, it is in SNESSolve_NEWTONLS. In old experiments, I did it inside
>> KSPSolve. Since KSPSolve can recursively call KSPSolve, the old results
>> were misleading.
>>  With these fixes, I measured differences of RSS and Petsc malloc
>> before/after KSPSolve. I did experiments on MacBook
>> using src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with
>> commands like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type beuler
>> -ts_max_steps 500 -malloc.
>>  I find if the grid size is small, I can see a non-zero RSS-delta
>> randomly, either with one mpi rank or multiple ranks, with MPICH or
>> OpenMPI. If I increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 256, I
>> only see non-zero RSS-delta randomly at the first few iterations (with
>> MPICH or OpenMPI). When the computer workload is high by simultaneously
>> running ex5-openmpi and ex5-mpich, the MPICH one pops up much more non-zero
>> RSS-delta. But "Malloc Delta" behavior is stable across all runs. There is
>> only one nonzero malloc delta value in the first KSPSolve call. All
>> remaining are zero. Something like this:
>>
>> mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type beuler
>> -ts_max_steps 500 -malloc
>> RSS Delta=       32489472, Malloc Delta=       26290304, RSS End=
>>  136114176
>> RSS Delta=          32768, Malloc Delta=              0, RSS End=
>>  138510336
>> RSS Delta=              0, Malloc Delta=              0, RSS End=
>>  138522624
>> RSS Delta=              0, Malloc Delta=              0, RSS End=
>>  138539008
>>
>> So I think I can conclude there is no unfreed memory in KSPSolve()
>> allocated by PETSc.  Has MPICH allocated unfreed memory in KSPSolve? That
>> is possible and I am trying to find a way like PetscMallocGetCurrentUsage()
>> to measure that. Also, I think RSS delta is not a good way to measure
>> memory allocation. It is dynamic and depends on state of the computer
>> (swap, shared libraries loaded etc) when running the code. We should focus
>> on malloc instead.  If there was a valgrind tool, like performance
>> profiling tools,  that can let users measure memory allocated but not freed
>> in a user specified code segment, that would be very helpful in this case.
>> But I have not found one.
>>
>>
>> Junchao
>>
>> Have you ever tried Massif?
>> http://valgrind.org/docs/manual/ms-manual.html
>>
>
> No. I came across it but not familiar with it.  I did not find APIs to
> call to get current memory usage. Will look at it further. Thanks.
>

This is definitely the correct tool. It intercepts all calls to
malloc()/free() so it can give you the complete picture of allocated
memory at any time. It will draw a line graph of this labeled by the
routine that does each allocation.

   Matt

> Sanjay, did you say currently you could run with OpenMPI without out of
>> memory, but with MPICH, you ran out of memory?  Is it feasible to share
>> your code so that I can test with? Thanks.
>>
>> --Junchao Zhang
>>
>> On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee <s_g at berkeley.edu> wrote:
>>
>>> Barry,
>>>
>>> If you look at the graphs I generated (on my Mac),  you will see that
>>> OpenMPI and MPICH have very different values (along with the fact that
>>> MPICH does not seem to adhere
>>> to the standard (for releasing MPI_ISend resources following and
>>> MPI_Wait).
>>>
>>> -sanjay
>>>
>>> PS: I agree with Barry's assessment; this is really not that acceptable.
>>>
>>> On 6/1/19 1:00 AM, Smith, Barry F. wrote:
>>> >    Junchao,
>>> >
>>> >       This is insane. Either the OpenMPI library or something in the
>>> OS underneath related to sockets and interprocess communication is grabbing
>>> additional space for each round of MPI communication!  Does MPICH have the
>>> same values or different values than OpenMP? When you run on Linux do you
>>> get the same values as Apple or different. --- Same values seem to indicate
>>> the issue is inside OpenMPI/MPICH different values indicates problem is
>>> more likely at the OS level. Does this happen only with the default
>>> VecScatter that uses blocking MPI, what happens with PetscSF under Vec? Is
>>> it somehow related to PETSc's use of nonblocking sends and receives? One
>>> could presumably use valgrind to see exactly what lines in what code are
>>> causing these increases. I don't think we can just shrug and say this is
>>> the way it is, we need to track down and understand the cause (and if
>>> possible fix).
>>> >
>>> >    Barry
>>> >
>>> >
>>> >> On May 31, 2019, at 2:53 PM, Zhang, Junchao <jczhang at mcs.anl.gov>
>>> wrote:
>>> >>
>>> >> Sanjay,
>>> >> I tried petsc with MPICH and OpenMPI on my Macbook. I inserted
>>> PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the beginning and
>>> end of KSPSolve and then computed the delta and summed over processes. Then
>>> I tested with src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
>>> >> With OpenMPI,
>>> >> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler
>>> -ts_max_steps 500 > 128.log
>>> >> grep -n -v "RSS Delta=         0, Malloc Delta=         0" 128.log
>>> >> 1:RSS Delta=     69632, Malloc Delta=         0
>>> >> 2:RSS Delta=     69632, Malloc Delta=         0
>>> >> 3:RSS Delta=     69632, Malloc Delta=         0
>>> >> 4:RSS Delta=     69632, Malloc Delta=         0
>>> >> 9:RSS Delta=9.25286e+06, Malloc Delta=         0
>>> >> 22:RSS Delta=     49152, Malloc Delta=         0
>>> >> 44:RSS Delta=     20480, Malloc Delta=         0
>>> >> 53:RSS Delta=     49152, Malloc Delta=         0
>>> >> 66:RSS Delta=      4096, Malloc Delta=         0
>>> >> 97:RSS Delta=     16384, Malloc Delta=         0
>>> >> 119:RSS Delta=     20480, Malloc Delta=         0
>>> >> 141:RSS Delta=     53248, Malloc Delta=         0
>>> >> 176:RSS Delta=     16384, Malloc Delta=         0
>>> >> 308:RSS Delta=     16384, Malloc Delta=         0
>>> >> 352:RSS Delta=     16384, Malloc Delta=         0
>>> >> 550:RSS Delta=     16384, Malloc Delta=         0
>>> >> 572:RSS Delta=     16384, Malloc Delta=         0
>>> >> 669:RSS Delta=     40960, Malloc Delta=         0
>>> >> 924:RSS Delta=     32768, Malloc Delta=         0
>>> >> 1694:RSS Delta=     20480, Malloc Delta=         0
>>> >> 2099:RSS Delta=     16384, Malloc Delta=         0
>>> >> 2244:RSS Delta=     20480, Malloc Delta=         0
>>> >> 3001:RSS Delta=     16384, Malloc Delta=         0
>>> >> 5883:RSS Delta=     16384, Malloc Delta=         0
>>> >>
>>> >> If I increased the grid
>>> >> mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler
>>> -ts_max_steps 500 -malloc_test >512.log
>>> >> grep -n -v "RSS Delta=         0, Malloc Delta=         0" 512.log
>>> >> 1:RSS Delta=1.05267e+06, Malloc Delta=         0
>>> >> 2:RSS Delta=1.05267e+06, Malloc Delta=         0
>>> >> 3:RSS Delta=1.05267e+06, Malloc Delta=         0
>>> >> 4:RSS Delta=1.05267e+06, Malloc Delta=         0
>>> >> 13:RSS Delta=1.24932e+08, Malloc Delta=         0
>>> >>
>>> >> So we did see RSS increase in 4k-page sizes after KSPSolve. As long
>>> as no memory leaks, why do you care about it? Is it because you run out of
>>> memory?
>>> >>
>>> >> On Thu, May 30, 2019 at 1:59 PM Smith, Barry F. <bsmith at mcs.anl.gov>
>>> wrote:
>>> >>
>>> >>     Thanks for the update. So the current conclusions are that using
>>> the Waitall in your code
>>> >>
>>> >> 1) solves the memory issue with OpenMPI in your code
>>> >>
>>> >> 2) does not solve the memory issue with PETSc KSPSolve
>>> >>
>>> >> 3) MPICH has memory issues both for your code and PETSc KSPSolve
>>> (despite) the wait all fix?
>>> >>
>>> >> If you literately just comment out the call to KSPSolve() with
>>> OpenMPI is there no growth in memory usage?
>>> >>
>>> >>
>>> >> Both 2 and 3 are concerning, indicate possible memory leak bugs in
>>> MPICH and not freeing all MPI resources in KSPSolve()
>>> >>
>>> >> Junchao, can you please investigate 2 and 3 with, for example, a TS
>>> example that uses the linear solver (like with -ts_type beuler)? Thanks
>>> >>
>>> >>
>>> >>    Barry
>>> >>
>>> >>
>>> >>
>>> >>> On May 30, 2019, at 1:47 PM, Sanjay Govindjee <s_g at berkeley.edu>
>>> wrote:
>>> >>>
>>> >>> Lawrence,
>>> >>> Thanks for taking a look!  This is what I had been wondering about
>>> -- my knowledge of MPI is pretty minimal and
>>> >>> this origins of the routine were from a programmer we hired a
>>> decade+ back from NERSC.  I'll have to look into
>>> >>> VecScatter.  It will be great to dispense with our roll-your-own
>>> routines (we even have our own reduceALL scattered around the code).
>>> >>>
>>> >>> Interestingly, the MPI_WaitALL has solved the problem when using
>>> OpenMPI but it still persists with MPICH.  Graphs attached.
>>> >>> I'm going to run with openmpi for now (but I guess I really still
>>> need to figure out what is wrong with MPICH and WaitALL;
>>> >>> I'll try Barry's suggestion of
>>> --download-mpich-configure-arguments="--enable-error-messages=all
>>> --enable-g" later today and report back).
>>> >>>
>>> >>> Regarding MPI_Barrier, it was put in due a problem that some
>>> processes were finishing up sending and receiving and exiting the subroutine
>>> >>> before the receiving processes had completed (which resulted in data
>>> loss as the buffers are freed after the call to the routine). MPI_Barrier
>>> was the solution proposed
>>> >>> to us.  I don't think I can dispense with it, but will think about
>>> some more.
>>> >>>
>>> >>> I'm not so sure about using MPI_IRecv as it will require a bit of
>>> rewriting since right now I process the received
>>> >>> data sequentially after each blocking MPI_Recv -- clearly slower but
>>> easier to code.
>>> >>>
>>> >>> Thanks again for the help.
>>> >>>
>>> >>> -sanjay
>>> >>>
>>> >>> On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
>>> >>>> Hi Sanjay,
>>> >>>>
>>> >>>>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users <
>>> petsc-users at mcs.anl.gov> wrote:
>>> >>>>>
>>> >>>>> The problem seems to persist but with a different signature.
>>> Graphs attached as before.
>>> >>>>>
>>> >>>>> Totals with MPICH (NB: single run)
>>> >>>>>
>>> >>>>> For the CG/Jacobi          data_exchange_total = 41,385,984;
>>> kspsolve_total = 38,289,408
>>> >>>>> For the GMRES/BJACOBI      data_exchange_total = 41,324,544;
>>> kspsolve_total = 41,324,544
>>> >>>>>
>>> >>>>> Just reading the MPI docs I am wondering if I need some sort of
>>> MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange routine?
>>> >>>>> I would have thought that with the blocking receives and the
>>> MPI_Barrier that everything will have fully completed and cleaned up before
>>> >>>>> all processes exited the routine, but perhaps I am wrong on that.
>>> >>>> Skimming the fortran code you sent you do:
>>> >>>>
>>> >>>> for i in ...:
>>> >>>>     call MPI_Isend(..., req, ierr)
>>> >>>>
>>> >>>> for i in ...:
>>> >>>>     call MPI_Recv(..., ierr)
>>> >>>>
>>> >>>> But you never call MPI_Wait on the request you got back from the
>>> Isend. So the MPI library will never free the data structures it created.
>>> >>>>
>>> >>>> The usual pattern for these non-blocking communications is to
>>> allocate an array for the requests of length nsend+nrecv and then do:
>>> >>>>
>>> >>>> for i in nsend:
>>> >>>>     call MPI_Isend(..., req[i], ierr)
>>> >>>> for j in nrecv:
>>> >>>>     call MPI_Irecv(..., req[nsend+j], ierr)
>>> >>>>
>>> >>>> call MPI_Waitall(req, ..., ierr)
>>> >>>>
>>> >>>> I note also there's no need for the Barrier at the end of the
>>> routine, this kind of communication does neighbourwise synchronisation, no
>>> need to add (unnecessary) global synchronisation too.
>>> >>>>
>>> >>>> As an aside, is there a reason you don't use PETSc's VecScatter to
>>> manage this global to local exchange?
>>> >>>>
>>> >>>> Cheers,
>>> >>>>
>>> >>>> Lawrence
>>> >>> <cg_mpichwall.png><cg_wall.png><gmres_mpichwall.png><gmres_wall.png>
>>>
>>>
>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190603/89b16f0e/attachment.html>


More information about the petsc-users mailing list