[petsc-users] memory use of a DMDA

Matthew Knepley knepley at gmail.com
Tue Oct 22 05:06:02 CDT 2013


On Tue, Oct 22, 2013 at 3:57 AM, Juha Jäykkä <juhaj at iki.fi> wrote:

> Barry,
>
> I seem to have touched a topic which goes way past my knowledge of PETSc
> internals, but it's very nice to see a thorough response nevertheless.
> Thank
> you. And Matthew, too.
>
> After reading your suspicions about number of ranks, I tried with 1, 2 and
> 4
> and the memory use indeed seems to go down from 1:
>

I am now convinced that /proc is showing total memory ever allocated since
the OS is not
recovering any freed memory. If you want to see memory allocated, but not
freed, just
do not destroy the DA and run with -malloc_test.

   Matt


> juhaj at dhcp071> CMD='import helpers;
> procdata=helpers._ProcessMemoryInfoProc();
> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; from
> petsc4py
> import PETSc; procdata=helpers._ProcessMemoryInfoProc(); print
> procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; da =
> PETSc.DA().create(sizes=[100,100,100],
> proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE], boundary_type=[3,0,0],
> stencil_type=PETSc.DA.StencilType.BOX, dof=7, stencil_width=1,
> comm=PETSc.COMM_WORLD); procdata=helpers._ProcessMemoryInfoProc(); print
> procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]'
> juhaj at dhcp071> mpirun -np 1 python -c "$CMD"
> 21 MiB / 22280 kB
> 21 MiB / 22304 kB
> 354 MiB / 419176 kB
> juhaj at dhcp071> mpirun -np 2 python -c "$CMD"
> 22 MiB / 23276 kB
> 22 MiB / 23020 kB
> 22 MiB / 23300 kB
> 22 MiB / 23044 kB
> 141 MiB / 145324 kB
> 141 MiB / 145068 kB
> juhaj at dhcp071> mpirun -np 4 python -c "$CMD"
> 22 MiB / 23292 kB
> 22 MiB / 23036 kB
> 22 MiB / 23316 kB
> 22 MiB / 23060 kB
> 22 MiB / 23316 kB
> 22 MiB / 23340 kB
> 22 MiB / 23044 kB
> 22 MiB / 23068 kB
> 81 MiB / 83716 kB
> 82 MiB / 83976 kB
> 81 MiB / 83964 kB
> 81 MiB / 83724 kB
>
> As one would expect, 4 ranks needs more memory than 2 ranks, but quite
> unexpectedly, 1 rank needs more than 2! I guess you are right: the
> 1-rank-case
> is not optimised and quite frankly, I don't mind: I only ever run small
> tests
> with one rank. Unfortunately, trying to create the simplest possible
> scenario
> to illustrate my point, I used a small DA and just one rank, precisely to
> avoid the case where the excess memory would be due to MPI buffers or such.
> Looks like my plan backfired. ;)
>
> But even still, my 53 MiB lattice, without any vectors created, takes 280
> or
> 320 MiB of memory – down to <6 from the original 6.6.
>
> I will test with 3.3 later today if I have the time, but I'm pretty sure
> things were "better" there.
>
> On Monday 21 October 2013 15:23:01 Barry Smith wrote:
> >    Matt,
> >
> >      I think you are running on 1 process where the DMDA doesn't have an
> > optimized path, when I run on 2 processes the numbers indicate nothing
> > proportional to dof* number of local points
> >
> > dof = 12
> > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > VecScatter [0] 7 21344 VecScatterCreate()
> > [0] 2 32 VecScatterCreateCommon_PtoS()
> > [0] 39 182480 VecScatterCreate_PtoS()
> >
> > dof = 8
> > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > VecScatter [0] 7 21344 VecScatterCreate()
> > [0] 2 32 VecScatterCreateCommon_PtoS()
> > [0] 39 176080 VecScatterCreate_PtoS()
> >
> > dof = 4
> >
> > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > VecScatter [0] 7 21344 VecScatterCreate()
> > [0] 2 32 VecScatterCreateCommon_PtoS()
> > [0] 39 169680 VecScatterCreate_PtoS()
> >
> > dof = 2
> > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > VecScatter [0] 7 21344 VecScatterCreate()
> > [0] 2 32 VecScatterCreateCommon_PtoS()
> > [0] 39 166480 VecScatterCreate_PtoS()
> >
> > dof =2 grid is 50 by 50 instead of 100 by 100
> >
> > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
> > VecScatter [0] 7 6352 VecScatterCreate()
> > [0] 2 32 VecScatterCreateCommon_PtoS()
> > [0] 39 43952 VecScatterCreate_PtoS()
> >
> > The IS creation in the DMDA is far more troubling
> >
> > /Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
> >
> > dof = 2
> >
> > [0] 1 20400 ISBlockSetIndices_Block()
> > [0] 15 3760 ISCreate()
> > [0] 4 128 ISCreate_Block()
> > [0] 1 16 ISCreate_Stride()
> > [0] 2 81600 ISGetIndices_Block()
> > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > [0] 7 42016 ISLocalToGlobalMappingCreate()
> >
> > dof = 4
> >
> > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
> > [0] 1 20400 ISBlockSetIndices_Block()
> > [0] 15 3760 ISCreate()
> > [0] 4 128 ISCreate_Block()
> > [0] 1 16 ISCreate_Stride()
> > [0] 2 163200 ISGetIndices_Block()
> > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > [0] 7 82816 ISLocalToGlobalMappingCreate()
> >
> > dof = 8
> >
> > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
> > [0] 1 20400 ISBlockSetIndices_Block()
> > [0] 15 3760 ISCreate()
> > [0] 4 128 ISCreate_Block()
> > [0] 1 16 ISCreate_Stride()
> > [0] 2 326400 ISGetIndices_Block()
> > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > [0] 7 164416 ISLocalToGlobalMappingCreate()
> >
> > dof = 12
> > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
> > [0] 1 20400 ISBlockSetIndices_Block()
> > [0] 15 3760 ISCreate()
> > [0] 4 128 ISCreate_Block()
> > [0] 1 16 ISCreate_Stride()
> > [0] 2 489600 ISGetIndices_Block()
> > [0] 1 20400 ISLocalToGlobalMappingBlock()
> > [0] 7 246016 ISLocalToGlobalMappingCreate()
> >
> > Here the accessing of indices is at the point level (as well as block)
> and
> > hence memory usage is proportional to dof* local number of grid points.
> Of
> > course it is still only proportional to the vector size. There is some
> > improvement we could make it; with a lot of refactoring we can remove the
> > dof* completely, with a little refactoring we can bring it down to a
> single
> > dof*local number of grid points.
> >
> >    I cannot understand why you are seeing memory usage 7 times more than
> a
> > vector. That seems like a lot.
> >
> >    Barry
> >
> > On Oct 21, 2013, at 11:32 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> > >   The PETSc DMDA object greedily allocates several arrays of data used
> to
> > >   set up the communication and other things like local to global
> mappings
> > >   even before you create any vectors. This is why you see this big bump
> > >   in memory usage.
> > >
> > >   BUT I don't think it should be any worse in 3.4 than in 3.3 or
> earlier;
> > >   at least we did not intend to make it worse. Are you sure it is using
> > >   more memory than in 3.3
> > >
> > >   In order for use to decrease the memory usage of the DMDA setup it
> would
> > >   be helpful if we knew which objects created within it used the most
> > >   memory.  There is some sloppiness in that routine of not reusing
> memory
> > >   as well as could be, not sure how much difference that would make.
> > >
> > >
> > >   Barry
> > >
> > > On Oct 21, 2013, at 7:02 AM, Juha Jäykkä <juhaj at iki.fi> wrote:
> > >> Dear list members,
> > >>
> > >> I have noticed strange memory consumption after upgrading to 3.4
> series.
> > >> I
> > >> never had time to properly investigate, but here is what happens [yes,
> > >> this
> > >> might be a petsc4py issue, but I doubt it] is
> > >>
> > >> # helpers contains _ProcessMemoryInfoProc routine which just digs the
> > >> memory # usage data from /proc
> > >> import helpers
> > >> procdata=helpers._ProcessMemoryInfoProc()
> > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > >> from petsc4py import PETSc
> > >> procdata=helpers._ProcessMemoryInfoProc()
> > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > >> da = PETSc.DA().create(sizes=[100,100,100],
> > >>
> > >>
>  proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE],
> > >>                      boundary_type=[3,0,0],
> > >>                      stencil_type=PETSc.DA.StencilType.BOX,
> > >>                      dof=7, stencil_width=1, comm=PETSc.COMM_WORLD)
> > >>
> > >> procdata=helpers._ProcessMemoryInfoProc()
> > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > >> vec=da.createGlobalVec()
> > >> procdata=helpers._ProcessMemoryInfoProc()
> > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
> > >>
> > >> outputs
> > >>
> > >> 48 MiB / 49348 kB
> > >> 48 MiB / 49360 kB
> > >> 381 MiB / 446228 kB
> > >> 435 MiB / 446228 kB
> > >>
> > >> Which is odd: size of the actual data to be stored in the da is just
> > >> about 56 megabytes, so why does creating the da consume 7 times that?
> > >> And why does the DA reserve the memory in the first place? I thought
> > >> memory only gets allocated once an associated vector is created and it
> > >> indeed looks like the
> > >> createGlobalVec call does indeed allocate the right amount of data.
> But
> > >> what is that 330 MiB that DA().create() consumes? [It's actually the
> > >> .setUp() method that does the consuming, but that's not of much use as
> > >> it needs to be called before a vector can be created.]
> > >>
> > >> Cheers,
> > >> Juha
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20131022/a9b9cc4b/attachment-0001.html>


More information about the petsc-users mailing list