[petsc-users] memory use of a DMDA

Matthew Knepley knepley at gmail.com
Tue Oct 22 05:47:04 CDT 2013


On Tue, Oct 22, 2013 at 5:12 AM, Juha Jäykkä <juhaj at iki.fi> wrote:

> On Tuesday 22 October 2013 05:06:02 Matthew Knepley wrote:
> > On Tue, Oct 22, 2013 at 3:57 AM, Juha Jäykkä <juhaj at iki.fi> wrote:
> > > Barry,
> > >
> > > I seem to have touched a topic which goes way past my knowledge of
> PETSc
> > > internals, but it's very nice to see a thorough response nevertheless.
> > > Thank
> > > you. And Matthew, too.
> > >
> > > After reading your suspicions about number of ranks, I tried with 1, 2
> and
> > > 4
> >
> > > and the memory use indeed seems to go down from 1:
> > I am now convinced that /proc is showing total memory ever allocated
> since
> > the OS is not
> > recovering any freed memory. If you want to see memory allocated, but not
> > freed, just
> > do not destroy the DA and run with -malloc_test.
>
> I'm not sure what you mean here: I'm interested in the maximum amount of
> memory used at any point during the program execution. /proc is supposed to
> know that. And for a longer run, ps, top and /proc do indeed agree, so I
> think
> I have the right numbers.
>
> Why the peak? Because I'm running on several machines where I get killed
> for
> exceeding a memory limit. Sometimes the limit is on VSZ, which is not a
> problem, but sometimes the limit is on RSS, which does present a problem,
> especially on some machines where there is no swap, so I need to stay below
> physical main memory limit all the time. It some of the memory gets freed
> later, it is of no use to me because by then I'm dead.
>
> If I misunderstood something, please point it out. ;)
>

We sometimes allocate temporary memory that we free afterwards (like 2
vectors which
we use to setup the scatters). It will not be allocated at the same time as
other vectors,
but if the OS does not reclaim the memory, it will show up on RSS.

  Matt


> Cheers,
> Juha
>
> >
> >    Matt
> >
> > > juhaj at dhcp071> CMD='import helpers;
> > > procdata=helpers._ProcessMemoryInfoProc();
> > > print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; from
> > > petsc4py
> > > import PETSc; procdata=helpers._ProcessMemoryInfoProc(); print
> > > procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; da =
> > > PETSc.DA().create(sizes=[100,100,100],
> > > proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE],
> > > boundary_type=[3,0,0],
> > > stencil_type=PETSc.DA.StencilType.BOX, dof=7, stencil_width=1,
> > > comm=PETSc.COMM_WORLD); procdata=helpers._ProcessMemoryInfoProc();
> print
> > > procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]'
> > > juhaj at dhcp071> mpirun -np 1 python -c "$CMD"
> > > 21 MiB / 22280 kB
> > > 21 MiB / 22304 kB
> > > 354 MiB / 419176 kB
> > > juhaj at dhcp071> mpirun -np 2 python -c "$CMD"
> > > 22 MiB / 23276 kB
> > > 22 MiB / 23020 kB
> > > 22 MiB / 23300 kB
> > > 22 MiB / 23044 kB
> > > 141 MiB / 145324 kB
> > > 141 MiB / 145068 kB
> > > juhaj at dhcp071> mpirun -np 4 python -c "$CMD"
> > > 22 MiB / 23292 kB
> > > 22 MiB / 23036 kB
> > > 22 MiB / 23316 kB
> > > 22 MiB / 23060 kB
> > > 22 MiB / 23316 kB
> > > 22 MiB / 23340 kB
> > > 22 MiB / 23044 kB
> > > 22 MiB / 23068 kB
> > > 81 MiB / 83716 kB
> > > 82 MiB / 83976 kB
> > > 81 MiB / 83964 kB
> > > 81 MiB / 83724 kB
> > >
> > > As one would expect, 4 ranks needs more memory than 2 ranks, but quite
> > > unexpectedly, 1 rank needs more than 2! I guess you are right: the
> > > 1-rank-case
> > > is not optimised and quite frankly, I don't mind: I only ever run small
> > > tests
> > > with one rank. Unfortunately, trying to create the simplest possible
> > > scenario
> > > to illustrate my point, I used a small DA and just one rank, precisely
> to
> > > avoid the case where the excess memory would be due to MPI buffers or
> > > such.
> > > Looks like my plan backfired. ;)
> > >
> > > But even still, my 53 MiB lattice, without any vectors created, takes
> 280
> > > or
> > > 320 MiB of memory – down to <6 from the original 6.6.
> > >
> > > I will test with 3.3 later today if I have the time, but I'm pretty
> sure
> > > things were "better" there.
> > >
> > > On Monday 21 October 2013 15:23:01 Barry Smith wrote:
> > > >    Matt,
> > > >
> > > >      I think you are running on 1 process where the DMDA doesn't
> have an
> > > >
> > > > optimized path, when I run on 2 processes the numbers indicate
> nothing
> > > > proportional to dof* number of local points
> > > >
> > > > dof = 12
> > > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > > > VecScatter [0] 7 21344 VecScatterCreate()
> > > > [0] 2 32 VecScatterCreateCommon_PtoS()
> > > > [0] 39 182480 VecScatterCreate_PtoS()
> > > >
> > > > dof = 8
> > > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > > > VecScatter [0] 7 21344 VecScatterCreate()
> > > > [0] 2 32 VecScatterCreateCommon_PtoS()
> > > > [0] 39 176080 VecScatterCreate_PtoS()
> > > >
> > > > dof = 4
> > > >
> > > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > > > VecScatter [0] 7 21344 VecScatterCreate()
> > > > [0] 2 32 VecScatterCreateCommon_PtoS()
> > > > [0] 39 169680 VecScatterCreate_PtoS()
> > > >
> > > > dof = 2
> > > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > > > VecScatter [0] 7 21344 VecScatterCreate()
> > > > [0] 2 32 VecScatterCreateCommon_PtoS()
> > > > [0] 39 166480 VecScatterCreate_PtoS()
> > > >
> > > > dof =2 grid is 50 by 50 instead of 100 by 100
> > > >
> > > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > > > VecScatter [0] 7 6352 VecScatterCreate()
> > > > [0] 2 32 VecScatterCreateCommon_PtoS()
> > > > [0] 39 43952 VecScatterCreate_PtoS()
> > > >
> > > > The IS creation in the DMDA is far more troubling
> > > >
> > > > /Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> IS
> > > >
> > > > dof = 2
> > > >
> > > > [0] 1 20400 ISBlockSetIndices_Block()
> > > > [0] 15 3760 ISCreate()
> > > > [0] 4 128 ISCreate_Block()
> > > > [0] 1 16 ISCreate_Stride()
> > > > [0] 2 81600 ISGetIndices_Block()
> > > > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > > > [0] 7 42016 ISLocalToGlobalMappingCreate()
> > > >
> > > > dof = 4
> > > >
> > > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log |
> grep IS
> > > > [0] 1 20400 ISBlockSetIndices_Block()
> > > > [0] 15 3760 ISCreate()
> > > > [0] 4 128 ISCreate_Block()
> > > > [0] 1 16 ISCreate_Stride()
> > > > [0] 2 163200 ISGetIndices_Block()
> > > > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > > > [0] 7 82816 ISLocalToGlobalMappingCreate()
> > > >
> > > > dof = 8
> > > >
> > > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log |
> grep IS
> > > > [0] 1 20400 ISBlockSetIndices_Block()
> > > > [0] 15 3760 ISCreate()
> > > > [0] 4 128 ISCreate_Block()
> > > > [0] 1 16 ISCreate_Stride()
> > > > [0] 2 326400 ISGetIndices_Block()
> > > > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > > > [0] 7 164416 ISLocalToGlobalMappingCreate()
> > > >
> > > > dof = 12
> > > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log |
> grep IS
> > > > [0] 1 20400 ISBlockSetIndices_Block()
> > > > [0] 15 3760 ISCreate()
> > > > [0] 4 128 ISCreate_Block()
> > > > [0] 1 16 ISCreate_Stride()
> > > > [0] 2 489600 ISGetIndices_Block()
> > > > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > > > [0] 7 246016 ISLocalToGlobalMappingCreate()
> > > >
> > > > Here the accessing of indices is at the point level (as well as
> block)
> > >
> > > and
> > >
> > > > hence memory usage is proportional to dof* local number of grid
> points.
> > >
> > > Of
> > >
> > > > course it is still only proportional to the vector size. There is
> some
> > > > improvement we could make it; with a lot of refactoring we can remove
> > > > the
> > > > dof* completely, with a little refactoring we can bring it down to a
> > >
> > > single
> > >
> > > > dof*local number of grid points.
> > > >
> > > >    I cannot understand why you are seeing memory usage 7 times more
> than
> > >
> > > a
> > >
> > > > vector. That seems like a lot.
> > > >
> > > >    Barry
> > > >
> > > > On Oct 21, 2013, at 11:32 AM, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
> > > > >   The PETSc DMDA object greedily allocates several arrays of data
> used
> > >
> > > to
> > >
> > > > >   set up the communication and other things like local to global
> > >
> > > mappings
> > >
> > > > >   even before you create any vectors. This is why you see this big
> > > > >   bump
> > > > >   in memory usage.
> > > > >
> > > > >   BUT I don't think it should be any worse in 3.4 than in 3.3 or
> > >
> > > earlier;
> > >
> > > > >   at least we did not intend to make it worse. Are you sure it is
> > > > >   using
> > > > >   more memory than in 3.3
> > > > >
> > > > >   In order for use to decrease the memory usage of the DMDA setup
> it
> > >
> > > would
> > >
> > > > >   be helpful if we knew which objects created within it used the
> most
> > > > >   memory.  There is some sloppiness in that routine of not reusing
> > >
> > > memory
> > >
> > > > >   as well as could be, not sure how much difference that would
> make.
> > > > >
> > > > >
> > > > >   Barry
> > > > >
> > > > > On Oct 21, 2013, at 7:02 AM, Juha Jäykkä <juhaj at iki.fi> wrote:
> > > > >> Dear list members,
> > > > >>
> > > > >> I have noticed strange memory consumption after upgrading to 3.4
> > >
> > > series.
> > >
> > > > >> I
> > > > >> never had time to properly investigate, but here is what happens
> > > > >> [yes,
> > > > >> this
> > > > >> might be a petsc4py issue, but I doubt it] is
> > > > >>
> > > > >> # helpers contains _ProcessMemoryInfoProc routine which just digs
> the
> > > > >> memory # usage data from /proc
> > > > >> import helpers
> > > > >> procdata=helpers._ProcessMemoryInfoProc()
> > > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > > > >> from petsc4py import PETSc
> > > > >> procdata=helpers._ProcessMemoryInfoProc()
> > > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > > > >> da = PETSc.DA().create(sizes=[100,100,100],
> > >
> > >  proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE],
> > >
> > > > >>                      boundary_type=[3,0,0],
> > > > >>                      stencil_type=PETSc.DA.StencilType.BOX,
> > > > >>                      dof=7, stencil_width=1,
> comm=PETSc.COMM_WORLD)
> > > > >>
> > > > >> procdata=helpers._ProcessMemoryInfoProc()
> > > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > > > >> vec=da.createGlobalVec()
> > > > >> procdata=helpers._ProcessMemoryInfoProc()
> > > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > > > >>
> > > > >> outputs
> > > > >>
> > > > >> 48 MiB / 49348 kB
> > > > >> 48 MiB / 49360 kB
> > > > >> 381 MiB / 446228 kB
> > > > >> 435 MiB / 446228 kB
> > > > >>
> > > > >> Which is odd: size of the actual data to be stored in the da is
> just
> > > > >> about 56 megabytes, so why does creating the da consume 7 times
> that?
> > > > >> And why does the DA reserve the memory in the first place? I
> thought
> > > > >> memory only gets allocated once an associated vector is created
> and
> > > > >> it
> > > > >> indeed looks like the
> > > > >> createGlobalVec call does indeed allocate the right amount of
> data.
> > >
> > > But
> > >
> > > > >> what is that 330 MiB that DA().create() consumes? [It's actually
> the
> > > > >> .setUp() method that does the consuming, but that's not of much
> use
> > > > >> as
> > > > >> it needs to be called before a vector can be created.]
> > > > >>
> > > > >> Cheers,
> > > > >> Juha
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20131022/c880e430/attachment-0001.html>


More information about the petsc-users mailing list