[petsc-users] PETSc/SLEPc: Memory consumption, particularly during solver initialization/solve

Thu Oct 4 15:59:32 CDT 2018

Matthew Knepley <knepley at gmail.com> writes:

> On Thu, Oct 4, 2018 at 1:54 PM Ale Foggia <amfoggia at gmail.com> wrote:
>
>> Thank you both for your answers :)
>>
>> Matt:
>> -Yes, sorry I forgot to tell you that, but I've also called
>> PetscMemorySetGetMaximumUsage() right after initializing SLEPc. Also I've
>> seen a strange behaviour: if I ran the same code in my computer and in the
>> cluster *without* the command line option -malloc_dump, in the cluster the
>> output of PetscMallocGetCurrentUsage and PetscMallocGetMaximumUsage is
>> always zero, but that doesn't happen in my computer.
>>
>> - This is the output of the code for the solving part (after EPSCreate and
>> after EPSSolve), and I've compared it with the output of *top* during those
>> moments of peak memory consumption. *top* provides in one of the columns
>> the resident set size (RES) and the numbers are around 1 GB per process,
>> while, considering the numbers reported by the PETSc functions, the one
>> that is more similar to that is given by MemoryGetCurrentUsage and is only
>> 800 MB in the solving stage. Maybe, we can consider that those numbers are
>> the same plus/minus something? Is it safe to say that MemoryGetCurrentUsage
>> is measuring the "ru_maxss" member of "rusage" (or something similar)? If
>> that's the case, what do the other functions report?
>>
>
> This is a perennial problem, since RSS is no guarantee of stuff that is
> actually being used, but only was allocated at some point. 

No, allocation alone does not make it resident on most operating
systems.  If you run top, you see a column VIRT (memory that was
allocated/mmap'd) and RES (actually resident in physical memory).

PetscMemoryGetCurrentUsage tries to get resident memory usage via
/proc/{PID}/statm or getrusage() ru_maxrss.

PetscMallocGetMaximumUsage just says how much memory has been allocated
using PETSc's tracing malloc (off by default with an optimized build,
but you can turn on by running with -malloc or related diagnostic
options).

> The best tool I have seen for this is Massif. I really recommend it:
>
>   http://valgrind.org/docs/manual/ms-manual.html
>
>   Thanks,
>
>      Matt
>
>
>> ==================== SOLVER INIT ====================
>> MallocGetCurrent (init): 396096192.0 B
>> MallocGetMaximum (init): 415178624.0 B
>> MemoryGetCurrent (init): 624050176.0 B
>> MemoryGetMaximum (init): 623775744.0 B
>> ==================== SOLVER ====================
>> MallocGetCurrent (solver): 560320256.0 B
>> MallocGetMaximum (solver): 560333440.0 B
>> MemoryGetCurrent (solver): 820961280.0 B
>> MemoryGetMaximum (solver): 623775744.0 B
>>
>> Jose:
>> - By each step I mean each of the step of the the program in order to
>> diagonalize the matrix. For me, those are: creation of basis, preallocation
>> of matrix, setting values of matrix, initializing solver,
>> solving/diagonalizing and cleaning. I'm only diagonalizing once.
>>
>> - Regarding the information provided by -log_view, it's confusing for me:
>> for example, it reports the creation of Vecs scattered across the various
>> stages that I've set up (with PetscLogStageRegister and
>> PetscLogStagePush/Pop), but almost all the deletions are presented in the
>> "Main Stage". What does that "Main Stage" consider? Why are more deletions
>> in there that creations? It's nor completely for me clear how things are
>> presented there.
>>
>> - Thanks for the suggestion about the solver. Does "faster convergence"
>> for Krylov-Schur mean less memory and less computation, or just less
>> computation?
>>
>> Ale
>>
>>
>> El jue., 4 oct. 2018 a las 13:12, Jose E. Roman (<jroman at dsic.upv.es>)
>> escribió:
>>
>>> Regarding the SLEPc part:
>>> - What do you mean by "each step"? Are you calling EPSSolve() several
>>> times?
>>> - Yes, the BV object is generally what takes most of the memory. It is
>>> allocated at the beginning of EPSSolve(). Depending on the solver/options,
>>> other memory may be allocated as well.
>>> - You can also see the memory reported at the end of -log_view
>>> - I would suggest using the default solver Krylov-Schur - it will do
>>> Lanczos with implicit restart, which will give faster convergence than the
>>> EPSLANCZOS solver.
>>>
>>> Jose
>>>
>>>
>>> > El 4 oct 2018, a las 12:49, Matthew Knepley <knepley at gmail.com>
>>> escribió:
>>> >
>>> > On Thu, Oct 4, 2018 at 4:43 AM Ale Foggia <amfoggia at gmail.com> wrote:
>>> > Hello all,
>>> >
>>> > I'm using SLEPc 3.9.2 (and PETSc 3.9.3) to get the EPS_SMALLEST_REAL of
>>> a matrix with the following characteristics:
>>> >
>>> > * type: real, Hermitian, sparse
>>> > * linear size: 2333606220
>>> > * distributed in 2048 processes (64 nodes, 32 procs per node)
>>> >
>>> > My code first preallocates the necessary memory with
>>> *MatMPIAIJSetPreallocation*, then fills it with the values and finally it
>>> calls the following functions to create the solver and diagonalize the
>>> matrix:
>>> >
>>> > EPSCreate(PETSC_COMM_WORLD, &solver);
>>> > EPSSetOperators(solver,matrix,NULL);
>>> > EPSSetProblemType(solver, EPS_HEP);
>>> > EPSSetType(solver, EPSLANCZOS);
>>> > EPSSetWhichEigenpairs(solver, EPS_SMALLEST_REAL);
>>> > EPSSetFromOptions(solver);
>>> > EPSSolve(solver);
>>> >
>>> > I want to make an estimation for larger size problems of the memory
>>> used by the program (at every step) because I would like to keep it under
>>> 16 GB per node. I've used the "memory usage" functions provided by PETSc,
>>> but something happens during the solver stage that I can't explain. This
>>> brings up two questions.
>>> >
>>> > 1) In each step I put a call to four memory functions and between them
>>> I print the value of mem:
>>> >
>>> > Did you call PetscMemorySetGetMaximumUsage() first?
>>> >
>>> > We are computing https://en.wikipedia.org/wiki/Resident_set_size
>>> however we can. Usually with getrusage().
>>> > From this (
>>> https://www.binarytides.com/linux-command-check-memory-usage/), it looks
>>> like top also reports
>>> > paged out memory.
>>> >
>>> >    Matt
>>> >
>>> > mem = 0;
>>> > PetscMallocGetCurrentUsage(&mem);
>>> > PetscMallocGetMaximumUsage(&mem);
>>> > PetscMemoryGetCurrentUsage(&mem);
>>> > PetscMemoryGetMaximumUsage(&mem);
>>> >
>>> > I've read some other question in the mailing list regarding the same
>>> issue but I can't fully understand this. What is the difference between all
>>> of them? What information are they actually giving me? (I know this is only
>>> a "per process" output). I copy the output of two steps of the program as
>>> an example:
>>> >
>>> > ==================== step N ====================
>>> > MallocGetCurrent: 314513664.0 B
>>> > MallocGetMaximum: 332723328.0 B
>>> > MemoryGetCurrent: 539996160.0 B
>>> > MemoryGetMaximum: 0.0 B
>>> > ==================== step N+1 ====================
>>> > MallocGetCurrent: 395902912.0 B
>>> > MallocGetMaximum: 415178624.0 B
>>> > MemoryGetCurrent: 623783936.0 B
>>> > MemoryGetMaximum: 623775744.0 B
>>> >
>>> > 2) I was using this information to make the calculation of the memory
>>> required per node to run my problem. Also, I'm able to login to the
>>> computing node while running and I can check the memory consumption (with
>>> *top*). The memory used that I see with top is more or less the same as the
>>> one reported by PETSc functions at the beginning. But during the
>>> inialization of the solver and during the solving, *top* reports a
>>> consumption two times bigger than the one the functions report. Is it
>>> possible to know from where this extra memory consumption comes from? What
>>> things does SLEPc allocate that need that much memory? I've been trying to
>>> do the math but I think there are things I'm missing. I thought that part
>>> of it comes from the "BV" that the option -eps_view reports:
>>> >
>>> > BV Object: 2048 MPI processes
>>> >   type: svec
>>> >   17 columns of global length 2333606220
>>> >   vector orthogonalization method: modified Gram-Schmidt
>>> >   orthogonalization refinement: if needed (eta: 0.7071)
>>> >   block orthogonalization method: GS
>>> >   doing matmult as a single matrix-matrix product
>>> >
>>> > But "17 * 2333606220 * 8 Bytes / #nodes" only explains on third or less
>>> of the "extra" memory.
>>> >
>>> > Ale
>>> >
>>> >
>>> >
>>> > --
>>> > What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> > -- Norbert Wiener
>>> >
>>> > https://www.cse.buffalo.edu/~knepley/
>>>
>>>
>
> -- 
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>