[petsc-users] DMPlex memory problem in scaling test

Thu Oct 10 18:28:10 CDT 2019

On Thu, Oct 10, 2019 at 4:26 PM Danyang Su <danyang.su at gmail.com> wrote:

> Hi All,
>
> Your guess is right. The memory problem occurs after
> DMPlexCreateFromCellList and DMPlexDistribute. The mesh related memory in
> the master processor is not released after that.
>
> The pseudo code I use is
>
> if (rank == 0) then         !only the master processor read the mesh file
> and create cell list
>
>         call DMPlexCreateFromCellList(Petsc_Comm_World,ndim,num_cells, &
>                                       num_nodes,num_nodes_per_cell,    &
>                                       Petsc_False,dmplex_cells,ndim,   &
> !use Petsc_True to create intermediate mesh entities (faces, edges),
>                                       dmplex_verts,dmda_flow%da,ierr)
> !not work for prism for the current 3.8 version.
>         CHKERRQ(ierr)
>
> else                                 !slave processors pass zero cells
>
>         call DMPlexCreateFromCellList(Petsc_Comm_World,ndim,0,0,       &
>                                       num_nodes_per_cell,              &
>                                       Petsc_False,dmplex_cells,ndim,   &
> !use Petsc_True to create intermediate mesh entities (faces, edges),
>                                       dmplex_verts,dmda_flow%da,ierr)
> !not work for prism for the current 3.8 version.
>         CHKERRQ(ierr)
>
> end if
>
> call DMPlexDistribute
>
> call DMDestroy(dmda_flow%da,ierr)
> CHKERRQ(ierr)
>
> !c set the global mesh as distributed mesh
> dmda_flow%da = distributedMesh
>
>
> After calling the above functions, the memory usage for the test case (no.
> points 953,433, nprocs 160) is shown below:
> rank 0 PETSc memory current MB    1610.39 PETSc memory maximum MB
> 1690.42
> rank 151 PETSc memory current MB     105.00 PETSc memory maximum MB
> 104.94
> rank 98 PETSc memory current MB     106.02 PETSc memory maximum MB
> 105.95
> rank 18 PETSc memory current MB     106.17 PETSc memory maximum MB
> 106.17
>
> Is there any function available in the master version that can release
> this memory?
>
DMDestroy() releases this memory, UNLESS you are holding other objects that
refer to it, like a vector from that DM.

  Thanks,

     Matt

> Thanks,
>
> Danyang
> On 2019-10-10 11:09 a.m., Mark Adams via petsc-users wrote:
>
> Now that I think about it, the partitioning and distribution can be done
> with existing API, I would assume, like is done with matrices.
>
> I'm still wondering what the H5 format is. I assume that it is not built
> for a hardwired number of processes to read in parallel and that the
> parallel read is somewhat scalable.
>
> On Thu, Oct 10, 2019 at 12:13 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>> A related question, what is the state of having something like a
>> distributed  DMPlexCreateFromCellList method, but maybe your H5 efforts
>> would work. My bone modeling code is old and a pain, but the apps
>> specialized serial mesh generator could write an H5 file instead of the
>> current FEAP file. Then you reader, SNES and a large deformation plasticity
>> element in PetscFE could replace my code, in the future.
>>
>> How does your H5 thing work? Is it basically a flat file (not
>> partitioned) that is read in in parallel by slicing the cell lists, etc,
>> using file seek or something equivalent, then reconstructing a local graph
>> on each processor to give to say Parmetis, then completes the distribution
>> with this reasonable partitioning? (this is what our current code does)
>>
>> Thanks,
>> Mark
>>
>> On Thu, Oct 10, 2019 at 9:30 AM Dave May via petsc-users <
>> petsc-users at mcs.anl.gov> wrote:
>>
>>>
>>>
>>> On Thu 10. Oct 2019 at 15:15, Matthew Knepley <knepley at gmail.com> wrote:
>>>
>>>> On Thu, Oct 10, 2019 at 9:10 AM Dave May <dave.mayhem23 at gmail.com>
>>>> wrote:
>>>>
>>>>> On Thu 10. Oct 2019 at 15:04, Matthew Knepley <knepley at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> On Thu, Oct 10, 2019 at 8:41 AM Dave May <dave.mayhem23 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Thu 10. Oct 2019 at 14:34, Matthew Knepley <knepley at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Thu, Oct 10, 2019 at 8:31 AM Dave May <dave.mayhem23 at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Thu, 10 Oct 2019 at 13:21, Matthew Knepley via petsc-users <
>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>
>>>>>>>>>> On Wed, Oct 9, 2019 at 5:10 PM Danyang Su via petsc-users <
>>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>
>>>>>>>>>>> Dear All,
>>>>>>>>>>>
>>>>>>>>>>> I have a question regarding the maximum memory usage for the
>>>>>>>>>>> scaling test. My code is written in Fortran with support for both
>>>>>>>>>>> structured grid (DM) and unstructured grid (DMPlex). It looks like memory
>>>>>>>>>>> consumption is much larger when DMPlex is used and finally causew
>>>>>>>>>>> out_of_memory problem.
>>>>>>>>>>>
>>>>>>>>>>> Below are some test using both structured grid and unstructured
>>>>>>>>>>> grid. The memory consumption by the code is estimated based on all
>>>>>>>>>>> allocated arrays and PETSc memory consumption is estimated based on
>>>>>>>>>>> PetscMemoryGetMaximumUsage.
>>>>>>>>>>>
>>>>>>>>>>> I just wonder why the PETSc memory consumption does not decrease
>>>>>>>>>>> when number of processors increases. For structured grid (scenario 7-9),
>>>>>>>>>>> the memory consumption decreases as number of processors increases.
>>>>>>>>>>> However, for unstructured grid case (scenario 14-16), the memory for PETSc
>>>>>>>>>>> part remains unchanged. When I run a larger case, the code crashes because
>>>>>>>>>>> memory is ran out. The same case works on another cluster with 480GB memory
>>>>>>>>>>> per node. Does this make sense?
>>>>>>>>>>>
>>>>>>>>>> We would need a finer breakdown of where memory is being used. I
>>>>>>>>>> did this for a paper:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/jgrb.50217
>>>>>>>>>>
>>>>>>>>>> If the subdomains, the halo sizes can overwhelm the basic
>>>>>>>>>> storage. It looks like the subdomains are big here,
>>>>>>>>>> but things are not totally clear to me. It would be helpful to
>>>>>>>>>> send the output of -log_view for each case since
>>>>>>>>>> PETSc tries to keep track of allocated memory.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Matt - I'd guess that there is a sequential (non-partitioned) mesh
>>>>>>>>> hanging around in memory.
>>>>>>>>> Is it possible that he's created the PLEX object which is loaded
>>>>>>>>> sequentially (stored and retained in memory and never released), and then
>>>>>>>>> afterwards distributed?
>>>>>>>>> This can never happen with the DMDA and the table verifies this.
>>>>>>>>> If his code using the DMDA and DMPLEX are as identical as possible
>>>>>>>>> (albeit the DM used), then a sequential mesh held in memory seems the
>>>>>>>>> likely cause.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Dang it, Dave is always right.
>>>>>>>>
>>>>>>>> How to prevent this?
>>>>>>>>
>>>>>>>
>>>>>>> I thought you/Lawrence/Vaclav/others... had developed and provided
>>>>>>> support  for a parallel DMPLEX load via a suitably defined plex specific H5
>>>>>>> mesh file.
>>>>>>>
>>>>>>
>>>>>> We have, but these tests looked like generated meshes.
>>>>>>
>>>>>
>>>>> Great.
>>>>>
>>>>> So would a solution to the problem be to have the user modify their
>>>>> code in the follow way:
>>>>> * they move the mesh gen stage into a seperate exec which they call
>>>>> offline (on a fat node with lots of memory), and dump the appropriate file
>>>>> * they change their existing application to simply load that file in
>>>>> parallel.
>>>>>
>>>>
>>>> Yes.
>>>>
>>>>
>>>>> If there were examples illustrating how to create the file which can
>>>>> be loaded in parallel I think it would be very helpful for the user (and
>>>>> many others)
>>>>>
>>>>
>>>> I think Vaclav is going to add his examples as soon as we fix this
>>>> parallel interpolation bug. I am praying for time in the latter
>>>> part of October to do this.
>>>>
>>>
>>>
>>> Excellent news - thanks for the update and info.
>>>
>>> Cheers
>>> Dave
>>>
>>>
>>>
>>>>   Thanks,
>>>>
>>>>     Matt
>>>>
>>>>
>>>>> Cheers
>>>>> Dave
>>>>>
>>>>>
>>>>>>   Thanks,
>>>>>>
>>>>>>     Matt
>>>>>>
>>>>>>
>>>>>>> Since it looks like you are okay with fairly regular meshes, I would
>>>>>>>> construct the
>>>>>>>> coarsest mesh you can, and then use
>>>>>>>>
>>>>>>>>   -dm_refine <k>
>>>>>>>>
>>>>>>>> which is activated by DMSetFromOptions(). Make sure to call it
>>>>>>>> after DMPlexDistribute(). It will regularly
>>>>>>>> refine in parallel and should show good memory scaling as Dave says.
>>>>>>>>
>>>>>>>>   Thanks,
>>>>>>>>
>>>>>>>>      Matt
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>   Thanks,
>>>>>>>>>>
>>>>>>>>>>      Matt
>>>>>>>>>>
>>>>>>>>>>> scenario no. points cell type DMPLex nprocs no. nodes mem per
>>>>>>>>>>> node GB solver Rank 0 memory MB Rank 0 petsc memory MB Runtime
>>>>>>>>>>> (sec)
>>>>>>>>>>> 1 2121 rectangle no 40 1 200 GMRES,Hypre preconditioner 0.21
>>>>>>>>>>> 41.6
>>>>>>>>>>> 2 8241 rectangle no 40 1 200 GMRES,Hypre preconditioner 0.59
>>>>>>>>>>> 51.84
>>>>>>>>>>> 3 32481 rectangle no 40 1 200 GMRES,Hypre preconditioner 1.95
>>>>>>>>>>> 59.1
>>>>>>>>>>> 4 128961 rectangle no 40 1 200 GMRES,Hypre preconditioner 7.05
>>>>>>>>>>> 89.71
>>>>>>>>>>> 5 513921 rectangle no 40 1 200 GMRES,Hypre preconditioner 26.76
>>>>>>>>>>> 110.58
>>>>>>>>>>> 6 2051841 rectangle no 40 1 200 GMRES,Hypre preconditioner
>>>>>>>>>>> 104.21 232.05
>>>>>>>>>>> *7* *8199681* *rectangle* *no* *40* *1* *200* *GMRES,Hypre
>>>>>>>>>>> preconditioner* *411.26* *703.27* *140.29*
>>>>>>>>>>> *8* *8199681* *rectangle* *no* *80* *2* *200* *GMRES,Hypre
>>>>>>>>>>> preconditioner* *206.6* *387.25* *62.04*
>>>>>>>>>>> *9* *8199681* *rectangle* *no* *160* *4* *200* *GMRES,Hypre
>>>>>>>>>>> preconditioner* *104.28* *245.3* *32.76*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 10 2121 triangle yes 40 1 200 GMRES,Hypre preconditioner 0.49
>>>>>>>>>>> 61.78
>>>>>>>>>>> 11 15090 triangle yes 40 1 200 GMRES,Hypre preconditioner 2.32
>>>>>>>>>>> 96.61
>>>>>>>>>>> 12 59847 triangle yes 40 1 200 GMRES,Hypre preconditioner 8.28
>>>>>>>>>>> 176.14
>>>>>>>>>>> 13 238568 triangle yes 40 1 200 GMRES,Hypre preconditioner 31.89
>>>>>>>>>>> 573.73
>>>>>>>>>>> *14* *953433* *triangle* *yes* *40* *1* *200* *GMRES,Hypre
>>>>>>>>>>> preconditioner* *119.23* *2102.54* *44.11*
>>>>>>>>>>> *15* *953433* *triangle* *yes* *80* *2* *200* *GMRES,Hypre
>>>>>>>>>>> preconditioner* *72.99* *2123.8* *24.36*
>>>>>>>>>>> *16* *953433* *triangle* *yes* *160* *4* *200* *GMRES,Hypre
>>>>>>>>>>> preconditioner* *48.65* *2076.25* *14.87*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 17 55770 prism yes 40 1 200 GMRES,Hypre preconditioner 18.46
>>>>>>>>>>> 219.39
>>>>>>>>>>> 18 749814 prism yes 40 1 200 GMRES,Hypre preconditioner 149.86
>>>>>>>>>>> 2412.39
>>>>>>>>>>> 19 7000050 prism yes 40 to 640 1 to 16 200 GMRES,Hypre
>>>>>>>>>>> preconditioner
>>>>>>>>>>> out_of_memory
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *20* *7000050* *prism* *yes* *64* *2* *480* *GMRES,Hypre
>>>>>>>>>>> preconditioner* *890.92* *17214.41*
>>>>>>>>>>>
>>>>>>>>>>> The error information of scenario 19 is shown below:
>>>>>>>>>>>
>>>>>>>>>>> kernel messages produced during job executions:
>>>>>>>>>>> [Oct 9 10:41] mpiexec.hydra invoked oom-killer:
>>>>>>>>>>> gfp_mask=0x200da, order=0, oom_score_adj=0
>>>>>>>>>>> [  +0.010274] mpiexec.hydra cpuset=/ mems_allowed=0-1
>>>>>>>>>>> [  +0.006680] CPU: 2 PID: 144904 Comm: mpiexec.hydra Tainted:
>>>>>>>>>>> G           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
>>>>>>>>>>> [  +0.013365] Hardware name: Lenovo ThinkSystem SD530
>>>>>>>>>>> -[7X21CTO1WW]-/-[7X21CTO1WW]-, BIOS -[TEE124N-1.40]- 06/12/2018
>>>>>>>>>>> [  +0.012866] Call Trace:
>>>>>>>>>>> [  +0.003945]  [<ffffffffb3313754>] dump_stack+0x19/0x1b
>>>>>>>>>>> [  +0.006995]  [<ffffffffb330e91f>] dump_header+0x90/0x229
>>>>>>>>>>> [  +0.007121]  [<ffffffffb2cfa982>] ? ktime_get_ts64+0x52/0xf0
>>>>>>>>>>> [  +0.007451]  [<ffffffffb2d5141f>] ? delayacct_end+0x8f/0xb0
>>>>>>>>>>> [  +0.007393]  [<ffffffffb2d9ac94>] oom_kill_process+0x254/0x3d0
>>>>>>>>>>> [  +0.007592]  [<ffffffffb2d9a73d>] ?
>>>>>>>>>>> oom_unkillable_task+0xcd/0x120
>>>>>>>>>>> [  +0.007978]  [<ffffffffb2d9a7e6>] ? find_lock_task_mm+0x56/0xc0
>>>>>>>>>>> [  +0.007729]  [<ffffffffb2d9b4d6>] *out_of_memory+0x4b6/0x4f0*
>>>>>>>>>>> [  +0.007358]  [<ffffffffb330f423>]
>>>>>>>>>>> __alloc_pages_slowpath+0x5d6/0x724
>>>>>>>>>>> [  +0.008190]  [<ffffffffb2da18b5>]
>>>>>>>>>>> __alloc_pages_nodemask+0x405/0x420
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Danyang
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>>>>> experiments lead.
>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>
>>>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>>> experiments lead.
>>>>>>>> -- Norbert Wiener
>>>>>>>>
>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> What most experimenters take for granted before they begin their
>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>> experiments lead.
>>>>>> -- Norbert Wiener
>>>>>>
>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>> https://www.cse.buffalo.edu/~knepley/
>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>
>>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20191010/17c6fd09/attachment-0001.html>