<div dir="ltr">On Tue, Oct 22, 2013 at 5:12 AM, Juha Jäykkä <span dir="ltr"><<a href="mailto:juhaj@iki.fi" target="_blank">juhaj@iki.fi</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">On Tuesday 22 October 2013 05:06:02 Matthew Knepley wrote:<br>

> On Tue, Oct 22, 2013 at 3:57 AM, Juha Jäykkä <<a href="mailto:juhaj@iki.fi">juhaj@iki.fi</a>> wrote:<br>

> > Barry,<br>

> ><br>

> > I seem to have touched a topic which goes way past my knowledge of PETSc<br>

> > internals, but it's very nice to see a thorough response nevertheless.<br>

> > Thank<br>

> > you. And Matthew, too.<br>

> ><br>

> > After reading your suspicions about number of ranks, I tried with 1, 2 and<br>

> > 4<br>

><br>

> > and the memory use indeed seems to go down from 1:<br>

> I am now convinced that /proc is showing total memory ever allocated since<br>

> the OS is not<br>

> recovering any freed memory. If you want to see memory allocated, but not<br>

> freed, just<br>

> do not destroy the DA and run with -malloc_test.<br>

<br>

</div>I'm not sure what you mean here: I'm interested in the maximum amount of<br>

memory used at any point during the program execution. /proc is supposed to<br>

know that. And for a longer run, ps, top and /proc do indeed agree, so I think<br>

I have the right numbers.<br>

<br>

Why the peak? Because I'm running on several machines where I get killed for<br>

exceeding a memory limit. Sometimes the limit is on VSZ, which is not a<br>

problem, but sometimes the limit is on RSS, which does present a problem,<br>

especially on some machines where there is no swap, so I need to stay below<br>

physical main memory limit all the time. It some of the memory gets freed<br>

later, it is of no use to me because by then I'm dead.<br>

<br>

If I misunderstood something, please point it out. ;)<br></blockquote><div><br></div><div>We sometimes allocate temporary memory that we free afterwards (like 2 vectors which</div><div>we use to setup the scatters). It will not be allocated at the same time as other vectors,</div>

<div>but if the OS does not reclaim the memory, it will show up on RSS.</div><div><br></div><div>  Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Cheers,<br>

Juha<br>

<div class="HOEnZb"><div class="h5"><br>

><br>

>    Matt<br>

><br>

> > juhaj@dhcp071> CMD='import helpers;<br>

> > procdata=helpers._ProcessMemoryInfoProc();<br>

> > print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; from<br>

> > petsc4py<br>

> > import PETSc; procdata=helpers._ProcessMemoryInfoProc(); print<br>

> > procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]; da =<br>

> > PETSc.DA().create(sizes=[100,100,100],<br>

> > proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE],<br>

> > boundary_type=[3,0,0],<br>

> > stencil_type=PETSc.DA.StencilType.BOX, dof=7, stencil_width=1,<br>

> > comm=PETSc.COMM_WORLD); procdata=helpers._ProcessMemoryInfoProc(); print<br>

> > procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]'<br>

> > juhaj@dhcp071> mpirun -np 1 python -c "$CMD"<br>

> > 21 MiB / 22280 kB<br>

> > 21 MiB / 22304 kB<br>

> > 354 MiB / 419176 kB<br>

> > juhaj@dhcp071> mpirun -np 2 python -c "$CMD"<br>

> > 22 MiB / 23276 kB<br>

> > 22 MiB / 23020 kB<br>

> > 22 MiB / 23300 kB<br>

> > 22 MiB / 23044 kB<br>

> > 141 MiB / 145324 kB<br>

> > 141 MiB / 145068 kB<br>

> > juhaj@dhcp071> mpirun -np 4 python -c "$CMD"<br>

> > 22 MiB / 23292 kB<br>

> > 22 MiB / 23036 kB<br>

> > 22 MiB / 23316 kB<br>

> > 22 MiB / 23060 kB<br>

> > 22 MiB / 23316 kB<br>

> > 22 MiB / 23340 kB<br>

> > 22 MiB / 23044 kB<br>

> > 22 MiB / 23068 kB<br>

> > 81 MiB / 83716 kB<br>

> > 82 MiB / 83976 kB<br>

> > 81 MiB / 83964 kB<br>

> > 81 MiB / 83724 kB<br>

> ><br>

> > As one would expect, 4 ranks needs more memory than 2 ranks, but quite<br>

> > unexpectedly, 1 rank needs more than 2! I guess you are right: the<br>

> > 1-rank-case<br>

> > is not optimised and quite frankly, I don't mind: I only ever run small<br>

> > tests<br>

> > with one rank. Unfortunately, trying to create the simplest possible<br>

> > scenario<br>

> > to illustrate my point, I used a small DA and just one rank, precisely to<br>

> > avoid the case where the excess memory would be due to MPI buffers or<br>

> > such.<br>

> > Looks like my plan backfired. ;)<br>

> ><br>

> > But even still, my 53 MiB lattice, without any vectors created, takes 280<br>

> > or<br>

> > 320 MiB of memory – down to <6 from the original 6.6.<br>

> ><br>

> > I will test with 3.3 later today if I have the time, but I'm pretty sure<br>

> > things were "better" there.<br>

> ><br>

> > On Monday 21 October 2013 15:23:01 Barry Smith wrote:<br>

> > >    Matt,<br>

> > ><br>

> > >      I think you are running on 1 process where the DMDA doesn't have an<br>

> > ><br>

> > > optimized path, when I run on 2 processes the numbers indicate nothing<br>

> > > proportional to dof* number of local points<br>

> > ><br>

> > > dof = 12<br>

> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep<br>

> > > VecScatter [0] 7 21344 VecScatterCreate()<br>

> > > [0] 2 32 VecScatterCreateCommon_PtoS()<br>

> > > [0] 39 182480 VecScatterCreate_PtoS()<br>

> > ><br>

> > > dof = 8<br>

> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep<br>

> > > VecScatter [0] 7 21344 VecScatterCreate()<br>

> > > [0] 2 32 VecScatterCreateCommon_PtoS()<br>

> > > [0] 39 176080 VecScatterCreate_PtoS()<br>

> > ><br>

> > > dof = 4<br>

> > ><br>

> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep<br>

> > > VecScatter [0] 7 21344 VecScatterCreate()<br>

> > > [0] 2 32 VecScatterCreateCommon_PtoS()<br>

> > > [0] 39 169680 VecScatterCreate_PtoS()<br>

> > ><br>

> > > dof = 2<br>

> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep<br>

> > > VecScatter [0] 7 21344 VecScatterCreate()<br>

> > > [0] 2 32 VecScatterCreateCommon_PtoS()<br>

> > > [0] 39 166480 VecScatterCreate_PtoS()<br>

> > ><br>

> > > dof =2 grid is 50 by 50 instead of 100 by 100<br>

> > ><br>

> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep<br>

> > > VecScatter [0] 7 6352 VecScatterCreate()<br>

> > > [0] 2 32 VecScatterCreateCommon_PtoS()<br>

> > > [0] 39 43952 VecScatterCreate_PtoS()<br>

> > ><br>

> > > The IS creation in the DMDA is far more troubling<br>

> > ><br>

> > > /Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS<br>

> > ><br>

> > > dof = 2<br>

> > ><br>

> > > [0] 1 20400 ISBlockSetIndices_Block()<br>

> > > [0] 15 3760 ISCreate()<br>

> > > [0] 4 128 ISCreate_Block()<br>

> > > [0] 1 16 ISCreate_Stride()<br>

> > > [0] 2 81600 ISGetIndices_Block()<br>

> > > [0] 1 20400 ISLocalToGlobalMappingBlock()<br>

> > > [0] 7 42016 ISLocalToGlobalMappingCreate()<br>

> > ><br>

> > > dof = 4<br>

> > ><br>

> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS<br>

> > > [0] 1 20400 ISBlockSetIndices_Block()<br>

> > > [0] 15 3760 ISCreate()<br>

> > > [0] 4 128 ISCreate_Block()<br>

> > > [0] 1 16 ISCreate_Stride()<br>

> > > [0] 2 163200 ISGetIndices_Block()<br>

> > > [0] 1 20400 ISLocalToGlobalMappingBlock()<br>

> > > [0] 7 82816 ISLocalToGlobalMappingCreate()<br>

> > ><br>

> > > dof = 8<br>

> > ><br>

> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS<br>

> > > [0] 1 20400 ISBlockSetIndices_Block()<br>

> > > [0] 15 3760 ISCreate()<br>

> > > [0] 4 128 ISCreate_Block()<br>

> > > [0] 1 16 ISCreate_Stride()<br>

> > > [0] 2 326400 ISGetIndices_Block()<br>

> > > [0] 1 20400 ISLocalToGlobalMappingBlock()<br>

> > > [0] 7 164416 ISLocalToGlobalMappingCreate()<br>

> > ><br>

> > > dof = 12<br>

> > > ~/Src/petsc/test  master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS<br>

> > > [0] 1 20400 ISBlockSetIndices_Block()<br>

> > > [0] 15 3760 ISCreate()<br>

> > > [0] 4 128 ISCreate_Block()<br>

> > > [0] 1 16 ISCreate_Stride()<br>

> > > [0] 2 489600 ISGetIndices_Block()<br>

> > > [0] 1 20400 ISLocalToGlobalMappingBlock()<br>

> > > [0] 7 246016 ISLocalToGlobalMappingCreate()<br>

> > ><br>

> > > Here the accessing of indices is at the point level (as well as block)<br>

> ><br>

> > and<br>

> ><br>

> > > hence memory usage is proportional to dof* local number of grid points.<br>

> ><br>

> > Of<br>

> ><br>

> > > course it is still only proportional to the vector size. There is some<br>

> > > improvement we could make it; with a lot of refactoring we can remove<br>

> > > the<br>

> > > dof* completely, with a little refactoring we can bring it down to a<br>

> ><br>

> > single<br>

> ><br>

> > > dof*local number of grid points.<br>

> > ><br>

> > >    I cannot understand why you are seeing memory usage 7 times more than<br>

> ><br>

> > a<br>

> ><br>

> > > vector. That seems like a lot.<br>

> > ><br>

> > >    Barry<br>

> > ><br>

> > > On Oct 21, 2013, at 11:32 AM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>

> > > >   The PETSc DMDA object greedily allocates several arrays of data used<br>

> ><br>

> > to<br>

> ><br>

> > > >   set up the communication and other things like local to global<br>

> ><br>

> > mappings<br>

> ><br>

> > > >   even before you create any vectors. This is why you see this big<br>

> > > >   bump<br>

> > > >   in memory usage.<br>

> > > ><br>

> > > >   BUT I don't think it should be any worse in 3.4 than in 3.3 or<br>

> ><br>

> > earlier;<br>

> ><br>

> > > >   at least we did not intend to make it worse. Are you sure it is<br>

> > > >   using<br>

> > > >   more memory than in 3.3<br>

> > > ><br>

> > > >   In order for use to decrease the memory usage of the DMDA setup it<br>

> ><br>

> > would<br>

> ><br>

> > > >   be helpful if we knew which objects created within it used the most<br>

> > > >   memory.  There is some sloppiness in that routine of not reusing<br>

> ><br>

> > memory<br>

> ><br>

> > > >   as well as could be, not sure how much difference that would make.<br>

> > > ><br>

> > > ><br>

> > > >   Barry<br>

> > > ><br>

> > > > On Oct 21, 2013, at 7:02 AM, Juha Jäykkä <<a href="mailto:juhaj@iki.fi">juhaj@iki.fi</a>> wrote:<br>

> > > >> Dear list members,<br>

> > > >><br>

> > > >> I have noticed strange memory consumption after upgrading to 3.4<br>

> ><br>

> > series.<br>

> ><br>

> > > >> I<br>

> > > >> never had time to properly investigate, but here is what happens<br>

> > > >> [yes,<br>

> > > >> this<br>

> > > >> might be a petsc4py issue, but I doubt it] is<br>

> > > >><br>

> > > >> # helpers contains _ProcessMemoryInfoProc routine which just digs the<br>

> > > >> memory # usage data from /proc<br>

> > > >> import helpers<br>

> > > >> procdata=helpers._ProcessMemoryInfoProc()<br>

> > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]<br>

> > > >> from petsc4py import PETSc<br>

> > > >> procdata=helpers._ProcessMemoryInfoProc()<br>

> > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]<br>

> > > >> da = PETSc.DA().create(sizes=[100,100,100],<br>

> ><br>

> >  proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE],<br>

> ><br>

> > > >>                      boundary_type=[3,0,0],<br>

> > > >>                      stencil_type=PETSc.DA.StencilType.BOX,<br>

> > > >>                      dof=7, stencil_width=1, comm=PETSc.COMM_WORLD)<br>

> > > >><br>

> > > >> procdata=helpers._ProcessMemoryInfoProc()<br>

> > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]<br>

> > > >> vec=da.createGlobalVec()<br>

> > > >> procdata=helpers._ProcessMemoryInfoProc()<br>

> > > >> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]<br>

> > > >><br>

> > > >> outputs<br>

> > > >><br>

> > > >> 48 MiB / 49348 kB<br>

> > > >> 48 MiB / 49360 kB<br>

> > > >> 381 MiB / 446228 kB<br>

> > > >> 435 MiB / 446228 kB<br>

> > > >><br>

> > > >> Which is odd: size of the actual data to be stored in the da is just<br>

> > > >> about 56 megabytes, so why does creating the da consume 7 times that?<br>

> > > >> And why does the DA reserve the memory in the first place? I thought<br>

> > > >> memory only gets allocated once an associated vector is created and<br>

> > > >> it<br>

> > > >> indeed looks like the<br>

> > > >> createGlobalVec call does indeed allocate the right amount of data.<br>

> ><br>

> > But<br>

> ><br>

> > > >> what is that 330 MiB that DA().create() consumes? [It's actually the<br>

> > > >> .setUp() method that does the consuming, but that's not of much use<br>

> > > >> as<br>

> > > >> it needs to be called before a vector can be created.]<br>

> > > >><br>

> > > >> Cheers,<br>

> > > >> Juha<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener

</div></div>