[petsc-users] memory use of a DMDA

Tue Oct 22 13:28:06 CDT 2013

   I have pushed a new branch to the PETSc repository called barry/reduce-dmsetup-da-memoryusage which cuts to 1/3 the amount of memory that is order dof * number of local grid points in the DMSetUp(), with this change you should see a pretty good improvement in "wasted" memory.

  See https://bitbucket.org/petsc/petsc/wiki/Home for accessing the branch.

   Barry

On Oct 22, 2013, at 3:57 AM, Juha Jäykkä <juhaj at iki.fi> wrote:

> Barry,
> 
> I seem to have touched a topic which goes way past my knowledge of PETSc 
> internals, but it's very nice to see a thorough response nevertheless. Thank 
> you. And Matthew, too.
> 
> After reading your suspicions about number of ranks, I tried with 1, 2 and 4 
> and the memory use indeed seems to go down from 1:
> 
> juhaj at dhcp071> CMD='import helpers; procdata=helpers._ProcessMemoryInfoProc(); 
> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; from petsc4py 
> import PETSc; procdata=helpers._ProcessMemoryInfoProc(); print 
> procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; da = 
> PETSc.DA().create(sizes=[100,100,100], 
> proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE], boundary_type=[3,0,0], 
> stencil_type=PETSc.DA.StencilType.BOX, dof=7, stencil_width=1, 
> comm=PETSc.COMM_WORLD); procdata=helpers._ProcessMemoryInfoProc(); print 
> procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]'
> juhaj at dhcp071> mpirun -np 1 python -c "$CMD"
> 21 MiB / 22280 kB
> 21 MiB / 22304 kB
> 354 MiB / 419176 kB
> juhaj at dhcp071> mpirun -np 2 python -c "$CMD"
> 22 MiB / 23276 kB
> 22 MiB / 23020 kB
> 22 MiB / 23300 kB
> 22 MiB / 23044 kB
> 141 MiB / 145324 kB
> 141 MiB / 145068 kB
> juhaj at dhcp071> mpirun -np 4 python -c "$CMD"
> 22 MiB / 23292 kB
> 22 MiB / 23036 kB
> 22 MiB / 23316 kB
> 22 MiB / 23060 kB
> 22 MiB / 23316 kB
> 22 MiB / 23340 kB
> 22 MiB / 23044 kB
> 22 MiB / 23068 kB
> 81 MiB / 83716 kB
> 82 MiB / 83976 kB
> 81 MiB / 83964 kB
> 81 MiB / 83724 kB
> 
> As one would expect, 4 ranks needs more memory than 2 ranks, but quite 
> unexpectedly, 1 rank needs more than 2! I guess you are right: the 1-rank-case 
> is not optimised and quite frankly, I don't mind: I only ever run small tests 
> with one rank. Unfortunately, trying to create the simplest possible scenario 
> to illustrate my point, I used a small DA and just one rank, precisely to 
> avoid the case where the excess memory would be due to MPI buffers or such. 
> Looks like my plan backfired. ;)
> 
> But even still, my 53 MiB lattice, without any vectors created, takes 280 or 
> 320 MiB of memory – down to <6 from the original 6.6.
> 
> I will test with 3.3 later today if I have the time, but I'm pretty sure 
> things were "better" there.
> 
> On Monday 21 October 2013 15:23:01 Barry Smith wrote:
>>   Matt,
>> 
>>     I think you are running on 1 process where the DMDA doesn't have an
>> optimized path, when I run on 2 processes the numbers indicate nothing
>> proportional to dof* number of local points
>> 
>> dof = 12
>> ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
>> VecScatter [0] 7 21344 VecScatterCreate()
>> [0] 2 32 VecScatterCreateCommon_PtoS()
>> [0] 39 182480 VecScatterCreate_PtoS()
>> 
>> dof = 8
>> ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
>> VecScatter [0] 7 21344 VecScatterCreate()
>> [0] 2 32 VecScatterCreateCommon_PtoS()
>> [0] 39 176080 VecScatterCreate_PtoS()
>> 
>> dof = 4
>> 
>> ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
>> VecScatter [0] 7 21344 VecScatterCreate()
>> [0] 2 32 VecScatterCreateCommon_PtoS()
>> [0] 39 169680 VecScatterCreate_PtoS()
>> 
>> dof = 2
>> ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
>> VecScatter [0] 7 21344 VecScatterCreate()
>> [0] 2 32 VecScatterCreateCommon_PtoS()
>> [0] 39 166480 VecScatterCreate_PtoS()
>> 
>> dof =2 grid is 50 by 50 instead of 100 by 100
>> 
>> ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep
>> VecScatter [0] 7 6352 VecScatterCreate()
>> [0] 2 32 VecScatterCreateCommon_PtoS()
>> [0] 39 43952 VecScatterCreate_PtoS()
>> 
>> The IS creation in the DMDA is far more troubling
>> 
>> /Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
>> 
>> dof = 2
>> 
>> [0] 1 20400 ISBlockSetIndices_Block()
>> [0] 15 3760 ISCreate()
>> [0] 4 128 ISCreate_Block()
>> [0] 1 16 ISCreate_Stride()
>> [0] 2 81600 ISGetIndices_Block()
>> [0] 1 20400 ISLocalToGlobalMappingBlock()
>> [0] 7 42016 ISLocalToGlobalMappingCreate()
>> 
>> dof = 4
>> 
>> ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
>> [0] 1 20400 ISBlockSetIndices_Block()
>> [0] 15 3760 ISCreate()
>> [0] 4 128 ISCreate_Block()
>> [0] 1 16 ISCreate_Stride()
>> [0] 2 163200 ISGetIndices_Block()
>> [0] 1 20400 ISLocalToGlobalMappingBlock()
>> [0] 7 82816 ISLocalToGlobalMappingCreate()
>> 
>> dof = 8
>> 
>> ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
>> [0] 1 20400 ISBlockSetIndices_Block()
>> [0] 15 3760 ISCreate()
>> [0] 4 128 ISCreate_Block()
>> [0] 1 16 ISCreate_Stride()
>> [0] 2 326400 ISGetIndices_Block()
>> [0] 1 20400 ISLocalToGlobalMappingBlock()
>> [0] 7 164416 ISLocalToGlobalMappingCreate()
>> 
>> dof = 12
>> ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS
>> [0] 1 20400 ISBlockSetIndices_Block()
>> [0] 15 3760 ISCreate()
>> [0] 4 128 ISCreate_Block()
>> [0] 1 16 ISCreate_Stride()
>> [0] 2 489600 ISGetIndices_Block()
>> [0] 1 20400 ISLocalToGlobalMappingBlock()
>> [0] 7 246016 ISLocalToGlobalMappingCreate()
>> 
>> Here the accessing of indices is at the point level (as well as block) and
>> hence memory usage is proportional to dof* local number of grid points. Of
>> course it is still only proportional to the vector size. There is some
>> improvement we could make it; with a lot of refactoring we can remove the
>> dof* completely, with a little refactoring we can bring it down to a single
>> dof*local number of grid points.
>> 
>>   I cannot understand why you are seeing memory usage 7 times more than a
>> vector. That seems like a lot.
>> 
>>   Barry
>> 
>> On Oct 21, 2013, at 11:32 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>  The PETSc DMDA object greedily allocates several arrays of data used to
>>>  set up the communication and other things like local to global mappings
>>>  even before you create any vectors. This is why you see this big bump
>>>  in memory usage.
>>> 
>>>  BUT I don't think it should be any worse in 3.4 than in 3.3 or earlier;
>>>  at least we did not intend to make it worse. Are you sure it is using
>>>  more memory than in 3.3
>>> 
>>>  In order for use to decrease the memory usage of the DMDA setup it would
>>>  be helpful if we knew which objects created within it used the most
>>>  memory.  There is some sloppiness in that routine of not reusing memory
>>>  as well as could be, not sure how much difference that would make.
>>> 
>>> 
>>>  Barry
>>> 
>>> On Oct 21, 2013, at 7:02 AM, Juha Jäykkä <juhaj at iki.fi> wrote:
>>>> Dear list members,
>>>> 
>>>> I have noticed strange memory consumption after upgrading to 3.4 series.
>>>> I
>>>> never had time to properly investigate, but here is what happens [yes,
>>>> this
>>>> might be a petsc4py issue, but I doubt it] is
>>>> 
>>>> # helpers contains _ProcessMemoryInfoProc routine which just digs the
>>>> memory # usage data from /proc
>>>> import helpers
>>>> procdata=helpers._ProcessMemoryInfoProc()
>>>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
>>>> from petsc4py import PETSc
>>>> procdata=helpers._ProcessMemoryInfoProc()
>>>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
>>>> da = PETSc.DA().create(sizes=[100,100,100],
>>>> 
>>>>                     proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE],
>>>>                     boundary_type=[3,0,0],
>>>>                     stencil_type=PETSc.DA.StencilType.BOX,
>>>>                     dof=7, stencil_width=1, comm=PETSc.COMM_WORLD)
>>>> 
>>>> procdata=helpers._ProcessMemoryInfoProc()
>>>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
>>>> vec=da.createGlobalVec()
>>>> procdata=helpers._ProcessMemoryInfoProc()
>>>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]
>>>> 
>>>> outputs
>>>> 
>>>> 48 MiB / 49348 kB
>>>> 48 MiB / 49360 kB
>>>> 381 MiB / 446228 kB
>>>> 435 MiB / 446228 kB
>>>> 
>>>> Which is odd: size of the actual data to be stored in the da is just
>>>> about 56 megabytes, so why does creating the da consume 7 times that?
>>>> And why does the DA reserve the memory in the first place? I thought
>>>> memory only gets allocated once an associated vector is created and it
>>>> indeed looks like the
>>>> createGlobalVec call does indeed allocate the right amount of data. But
>>>> what is that 330 MiB that DA().create() consumes? [It's actually the
>>>> .setUp() method that does the consuming, but that's not of much use as
>>>> it needs to be called before a vector can be created.]
>>>> 
>>>> Cheers,
>>>> Juha