[petsc-users] memory use of a DMDA

Juha Jäykkä juhaj at iki.fi
Tue Oct 22 05:12:31 CDT 2013


On Tuesday 22 October 2013 05:06:02 Matthew Knepley wrote:
> On Tue, Oct 22, 2013 at 3:57 AM, Juha Jäykkä <juhaj at iki.fi> wrote:
> > Barry,
> > 
> > I seem to have touched a topic which goes way past my knowledge of PETSc
> > internals, but it's very nice to see a thorough response nevertheless.
> > Thank
> > you. And Matthew, too.
> > 
> > After reading your suspicions about number of ranks, I tried with 1, 2 and
> > 4
> 
> > and the memory use indeed seems to go down from 1:
> I am now convinced that /proc is showing total memory ever allocated since
> the OS is not
> recovering any freed memory. If you want to see memory allocated, but not
> freed, just
> do not destroy the DA and run with -malloc_test.

I'm not sure what you mean here: I'm interested in the maximum amount of 
memory used at any point during the program execution. /proc is supposed to 
know that. And for a longer run, ps, top and /proc do indeed agree, so I think 
I have the right numbers.

Why the peak? Because I'm running on several machines where I get killed for 
exceeding a memory limit. Sometimes the limit is on VSZ, which is not a 
problem, but sometimes the limit is on RSS, which does present a problem, 
especially on some machines where there is no swap, so I need to stay below 
physical main memory limit all the time. It some of the memory gets freed 
later, it is of no use to me because by then I'm dead.

If I misunderstood something, please point it out. ;)

Cheers,
Juha

> 
>    Matt
> 
> > juhaj at dhcp071> CMD='import helpers;
> > procdata=helpers._ProcessMemoryInfoProc();
> > print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; from
> > petsc4py
> > import PETSc; procdata=helpers._ProcessMemoryInfoProc(); print
> > procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; da =
> > PETSc.DA().create(sizes=[100,100,100],
> > proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE],
> > boundary_type=[3,0,0],
> > stencil_type=PETSc.DA.StencilType.BOX, dof=7, stencil_width=1,
> > comm=PETSc.COMM_WORLD); procdata=helpers._ProcessMemoryInfoProc(); print
> > procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]'
> > juhaj at dhcp071> mpirun -np 1 python -c "$CMD"
> > 21 MiB / 22280 kB
> > 21 MiB / 22304 kB
> > 354 MiB / 419176 kB
> > juhaj at dhcp071> mpirun -np 2 python -c "$CMD"
> > 22 MiB / 23276 kB
> > 22 MiB / 23020 kB
> > 22 MiB / 23300 kB
> > 22 MiB / 23044 kB
> > 141 MiB / 145324 kB
> > 141 MiB / 145068 kB
> > juhaj at dhcp071> mpirun -np 4 python -c "$CMD"
> > 22 MiB / 23292 kB
> > 22 MiB / 23036 kB
> > 22 MiB / 23316 kB
> > 22 MiB / 23060 kB
> > 22 MiB / 23316 kB
> > 22 MiB / 23340 kB
> > 22 MiB / 23044 kB
> > 22 MiB / 23068 kB
> > 81 MiB / 83716 kB
> > 82 MiB / 83976 kB
> > 81 MiB / 83964 kB
> > 81 MiB / 83724 kB
> > 
> > As one would expect, 4 ranks needs more memory than 2 ranks, but quite
> > unexpectedly, 1 rank needs more than 2! I guess you are right: the
> > 1-rank-case
> > is not optimised and quite frankly, I don't mind: I only ever run small
> > tests
> > with one rank. Unfortunately, trying to create the simplest possible
> > scenario
> > to illustrate my point, I used a small DA and just one rank, precisely to
> > avoid the case where the excess memory would be due to MPI buffers or
> > such.
> > Looks like my plan backfired. ;)
> > 
> > But even still, my 53 MiB lattice, without any vectors created, takes 280
> > or
> > 320 MiB of memory – down to <6 from the original 6.6.
> > 
> > I will test with 3.3 later today if I have the time, but I'm pretty sure
> > things were "better" there.
> > 
> > On Monday 21 October 2013 15:23:01 Barry Smith wrote:
> > >    Matt,
> > >    
> > >      I think you are running on 1 process where the DMDA doesn't have an
> > > 
> > > optimized path, when I run on 2 processes the numbers indicate nothing
> > > proportional to dof* number of local points
> > > 
> > > dof = 12
> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > > VecScatter [0] 7 21344 VecScatterCreate()
> > > [0] 2 32 VecScatterCreateCommon_PtoS()
> > > [0] 39 182480 VecScatterCreate_PtoS()
> > > 
> > > dof = 8
> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > > VecScatter [0] 7 21344 VecScatterCreate()
> > > [0] 2 32 VecScatterCreateCommon_PtoS()
> > > [0] 39 176080 VecScatterCreate_PtoS()
> > > 
> > > dof = 4
> > > 
> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > > VecScatter [0] 7 21344 VecScatterCreate()
> > > [0] 2 32 VecScatterCreateCommon_PtoS()
> > > [0] 39 169680 VecScatterCreate_PtoS()
> > > 
> > > dof = 2
> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > > VecScatter [0] 7 21344 VecScatterCreate()
> > > [0] 2 32 VecScatterCreateCommon_PtoS()
> > > [0] 39 166480 VecScatterCreate_PtoS()
> > > 
> > > dof =2 grid is 50 by 50 instead of 100 by 100
> > > 
> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > > VecScatter [0] 7 6352 VecScatterCreate()
> > > [0] 2 32 VecScatterCreateCommon_PtoS()
> > > [0] 39 43952 VecScatterCreate_PtoS()
> > > 
> > > The IS creation in the DMDA is far more troubling
> > > 
> > > /Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
> > > 
> > > dof = 2
> > > 
> > > [0] 1 20400 ISBlockSetIndices_Block()
> > > [0] 15 3760 ISCreate()
> > > [0] 4 128 ISCreate_Block()
> > > [0] 1 16 ISCreate_Stride()
> > > [0] 2 81600 ISGetIndices_Block()
> > > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > > [0] 7 42016 ISLocalToGlobalMappingCreate()
> > > 
> > > dof = 4
> > > 
> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
> > > [0] 1 20400 ISBlockSetIndices_Block()
> > > [0] 15 3760 ISCreate()
> > > [0] 4 128 ISCreate_Block()
> > > [0] 1 16 ISCreate_Stride()
> > > [0] 2 163200 ISGetIndices_Block()
> > > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > > [0] 7 82816 ISLocalToGlobalMappingCreate()
> > > 
> > > dof = 8
> > > 
> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
> > > [0] 1 20400 ISBlockSetIndices_Block()
> > > [0] 15 3760 ISCreate()
> > > [0] 4 128 ISCreate_Block()
> > > [0] 1 16 ISCreate_Stride()
> > > [0] 2 326400 ISGetIndices_Block()
> > > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > > [0] 7 164416 ISLocalToGlobalMappingCreate()
> > > 
> > > dof = 12
> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
> > > [0] 1 20400 ISBlockSetIndices_Block()
> > > [0] 15 3760 ISCreate()
> > > [0] 4 128 ISCreate_Block()
> > > [0] 1 16 ISCreate_Stride()
> > > [0] 2 489600 ISGetIndices_Block()
> > > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > > [0] 7 246016 ISLocalToGlobalMappingCreate()
> > > 
> > > Here the accessing of indices is at the point level (as well as block)
> > 
> > and
> > 
> > > hence memory usage is proportional to dof* local number of grid points.
> > 
> > Of
> > 
> > > course it is still only proportional to the vector size. There is some
> > > improvement we could make it; with a lot of refactoring we can remove
> > > the
> > > dof* completely, with a little refactoring we can bring it down to a
> > 
> > single
> > 
> > > dof*local number of grid points.
> > > 
> > >    I cannot understand why you are seeing memory usage 7 times more than
> > 
> > a
> > 
> > > vector. That seems like a lot.
> > > 
> > >    Barry
> > > 
> > > On Oct 21, 2013, at 11:32 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> > > >   The PETSc DMDA object greedily allocates several arrays of data used
> > 
> > to
> > 
> > > >   set up the communication and other things like local to global
> > 
> > mappings
> > 
> > > >   even before you create any vectors. This is why you see this big
> > > >   bump
> > > >   in memory usage.
> > > >   
> > > >   BUT I don't think it should be any worse in 3.4 than in 3.3 or
> > 
> > earlier;
> > 
> > > >   at least we did not intend to make it worse. Are you sure it is
> > > >   using
> > > >   more memory than in 3.3
> > > >   
> > > >   In order for use to decrease the memory usage of the DMDA setup it
> > 
> > would
> > 
> > > >   be helpful if we knew which objects created within it used the most
> > > >   memory.  There is some sloppiness in that routine of not reusing
> > 
> > memory
> > 
> > > >   as well as could be, not sure how much difference that would make.
> > > >   
> > > >   
> > > >   Barry
> > > > 
> > > > On Oct 21, 2013, at 7:02 AM, Juha Jäykkä <juhaj at iki.fi> wrote:
> > > >> Dear list members,
> > > >> 
> > > >> I have noticed strange memory consumption after upgrading to 3.4
> > 
> > series.
> > 
> > > >> I
> > > >> never had time to properly investigate, but here is what happens
> > > >> [yes,
> > > >> this
> > > >> might be a petsc4py issue, but I doubt it] is
> > > >> 
> > > >> # helpers contains _ProcessMemoryInfoProc routine which just digs the
> > > >> memory # usage data from /proc
> > > >> import helpers
> > > >> procdata=helpers._ProcessMemoryInfoProc()
> > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > > >> from petsc4py import PETSc
> > > >> procdata=helpers._ProcessMemoryInfoProc()
> > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > > >> da = PETSc.DA().create(sizes=[100,100,100],
> >  
> >  proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE],
> >  
> > > >>                      boundary_type=[3,0,0],
> > > >>                      stencil_type=PETSc.DA.StencilType.BOX,
> > > >>                      dof=7, stencil_width=1, comm=PETSc.COMM_WORLD)
> > > >> 
> > > >> procdata=helpers._ProcessMemoryInfoProc()
> > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > > >> vec=da.createGlobalVec()
> > > >> procdata=helpers._ProcessMemoryInfoProc()
> > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > > >> 
> > > >> outputs
> > > >> 
> > > >> 48 MiB / 49348 kB
> > > >> 48 MiB / 49360 kB
> > > >> 381 MiB / 446228 kB
> > > >> 435 MiB / 446228 kB
> > > >> 
> > > >> Which is odd: size of the actual data to be stored in the da is just
> > > >> about 56 megabytes, so why does creating the da consume 7 times that?
> > > >> And why does the DA reserve the memory in the first place? I thought
> > > >> memory only gets allocated once an associated vector is created and
> > > >> it
> > > >> indeed looks like the
> > > >> createGlobalVec call does indeed allocate the right amount of data.
> > 
> > But
> > 
> > > >> what is that 330 MiB that DA().create() consumes? [It's actually the
> > > >> .setUp() method that does the consuming, but that's not of much use
> > > >> as
> > > >> it needs to be called before a vector can be created.]
> > > >> 
> > > >> Cheers,
> > > >> Juha


More information about the petsc-users mailing list