<div dir="ltr">On Mon, Oct 21, 2013 at 3:23 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Matt,<br>
<br>
I think you are running on 1 process where the DMDA doesn't have an optimized path, when I run on 2 processes the numbers indicate nothing proportional to dof* number of local points<br></blockquote><div><br></div>
<div>Yes, I figured if it was not doing the right thing on 1, why go to more? :)</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
dof = 12<br>
~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep VecScatter<br>
[0] 7 21344 VecScatterCreate()<br>
[0] 2 32 VecScatterCreateCommon_PtoS()<br>
[0] 39 182480 VecScatterCreate_PtoS()<br>
<br>
dof = 8<br>
~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep VecScatter<br>
[0] 7 21344 VecScatterCreate()<br>
[0] 2 32 VecScatterCreateCommon_PtoS()<br>
[0] 39 176080 VecScatterCreate_PtoS()<br>
<br>
dof = 4<br>
<br>
~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep VecScatter<br>
[0] 7 21344 VecScatterCreate()<br>
[0] 2 32 VecScatterCreateCommon_PtoS()<br>
[0] 39 169680 VecScatterCreate_PtoS()<br>
<br>
dof = 2<br>
~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep VecScatter<br>
[0] 7 21344 VecScatterCreate()<br>
[0] 2 32 VecScatterCreateCommon_PtoS()<br>
[0] 39 166480 VecScatterCreate_PtoS()<br>
<br>
dof =2 grid is 50 by 50 instead of 100 by 100<br>
<br>
~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep VecScatter<br>
[0] 7 6352 VecScatterCreate()<br>
[0] 2 32 VecScatterCreateCommon_PtoS()<br>
[0] 39 43952 VecScatterCreate_PtoS()<br>
<br>
The IS creation in the DMDA is far more troubling<br>
<br>
/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS<br>
<br>
dof = 2<br>
<br>
[0] 1 20400 ISBlockSetIndices_Block()<br>
[0] 15 3760 ISCreate()<br>
[0] 4 128 ISCreate_Block()<br>
[0] 1 16 ISCreate_Stride()<br>
[0] 2 81600 ISGetIndices_Block()<br>
[0] 1 20400 ISLocalToGlobalMappingBlock()<br>
[0] 7 42016 ISLocalToGlobalMappingCreate()<br>
<br>
dof = 4<br>
<br>
~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS<br>
[0] 1 20400 ISBlockSetIndices_Block()<br>
[0] 15 3760 ISCreate()<br>
[0] 4 128 ISCreate_Block()<br>
[0] 1 16 ISCreate_Stride()<br>
[0] 2 163200 ISGetIndices_Block()<br>
[0] 1 20400 ISLocalToGlobalMappingBlock()<br>
[0] 7 82816 ISLocalToGlobalMappingCreate()<br>
<br>
dof = 8<br>
<br>
~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS<br>
[0] 1 20400 ISBlockSetIndices_Block()<br>
[0] 15 3760 ISCreate()<br>
[0] 4 128 ISCreate_Block()<br>
[0] 1 16 ISCreate_Stride()<br>
[0] 2 326400 ISGetIndices_Block()<br>
[0] 1 20400 ISLocalToGlobalMappingBlock()<br>
[0] 7 164416 ISLocalToGlobalMappingCreate()<br>
<br>
dof = 12<br>
~/Src/petsc/test master $ petscmpiexec -n 2 ./ex1 -malloc_log | grep IS<br>
[0] 1 20400 ISBlockSetIndices_Block()<br>
[0] 15 3760 ISCreate()<br>
[0] 4 128 ISCreate_Block()<br>
[0] 1 16 ISCreate_Stride()<br>
[0] 2 489600 ISGetIndices_Block()<br>
[0] 1 20400 ISLocalToGlobalMappingBlock()<br>
[0] 7 246016 ISLocalToGlobalMappingCreate()<br>
<br>
Here the accessing of indices is at the point level (as well as block) and hence memory usage is proportional to dof* local number of grid points. Of course it is still only proportional to the vector size. There is some improvement we could make it; with a lot of refactoring we can remove the dof* completely, with a little refactoring we can bring it down to a single dof*local number of grid points.<br>
<br>
I cannot understand why you are seeing memory usage 7 times more than a vector. That seems like a lot.<br>
<span class="HOEnZb"><font color="#888888"><br>
Barry<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
<br>
<br>
On Oct 21, 2013, at 11:32 AM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
<br>
><br>
> The PETSc DMDA object greedily allocates several arrays of data used to set up the communication and other things like local to global mappings even before you create any vectors. This is why you see this big bump in memory usage.<br>
><br>
> BUT I don't think it should be any worse in 3.4 than in 3.3 or earlier; at least we did not intend to make it worse. Are you sure it is using more memory than in 3.3<br>
><br>
> In order for use to decrease the memory usage of the DMDA setup it would be helpful if we knew which objects created within it used the most memory. There is some sloppiness in that routine of not reusing memory as well as could be, not sure how much difference that would make.<br>
><br>
><br>
> Barry<br>
><br>
><br>
><br>
> On Oct 21, 2013, at 7:02 AM, Juha Jäykkä <<a href="mailto:juhaj@iki.fi">juhaj@iki.fi</a>> wrote:<br>
><br>
>> Dear list members,<br>
>><br>
>> I have noticed strange memory consumption after upgrading to 3.4 series. I<br>
>> never had time to properly investigate, but here is what happens [yes, this<br>
>> might be a petsc4py issue, but I doubt it] is<br>
>><br>
>> # helpers contains _ProcessMemoryInfoProc routine which just digs the memory<br>
>> # usage data from /proc<br>
>> import helpers<br>
>> procdata=helpers._ProcessMemoryInfoProc()<br>
>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]<br>
>> from petsc4py import PETSc<br>
>> procdata=helpers._ProcessMemoryInfoProc()<br>
>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]<br>
>> da = PETSc.DA().create(sizes=[100,100,100],<br>
>> proc_sizes=[PETSc.DECIDE,PETSc.DECIDE,PETSc.DECIDE],<br>
>> boundary_type=[3,0,0],<br>
>> stencil_type=PETSc.DA.StencilType.BOX,<br>
>> dof=7, stencil_width=1, comm=PETSc.COMM_WORLD)<br>
>> procdata=helpers._ProcessMemoryInfoProc()<br>
>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]<br>
>> vec=da.createGlobalVec()<br>
>> procdata=helpers._ProcessMemoryInfoProc()<br>
>> print procdata.rss/2**20, "MiB /", procdata.os_specific[3][1]<br>
>><br>
>> outputs<br>
>><br>
>> 48 MiB / 49348 kB<br>
>> 48 MiB / 49360 kB<br>
>> 381 MiB / 446228 kB<br>
>> 435 MiB / 446228 kB<br>
>><br>
>> Which is odd: size of the actual data to be stored in the da is just about 56<br>
>> megabytes, so why does creating the da consume 7 times that? And why does the<br>
>> DA reserve the memory in the first place? I thought memory only gets allocated<br>
>> once an associated vector is created and it indeed looks like the<br>
>> createGlobalVec call does indeed allocate the right amount of data. But what<br>
>> is that 330 MiB that DA().create() consumes? [It's actually the .setUp()<br>
>> method that does the consuming, but that's not of much use as it needs to be<br>
>> called before a vector can be created.]<br>
>><br>
>> Cheers,<br>
>> Juha<br>
>><br>
><br>
<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener
</div></div>