[petsc-users] DMPlex memory problem in scaling test
Matthew Knepley
knepley at gmail.com
Thu Oct 10 19:44:00 CDT 2019
On Thu, Oct 10, 2019 at 7:53 PM Danyang Su <danyang.su at gmail.com> wrote:
> On 2019-10-10 4:28 p.m., Matthew Knepley wrote:
>
> On Thu, Oct 10, 2019 at 4:26 PM Danyang Su <danyang.su at gmail.com> wrote:
>
>> Hi All,
>>
>> Your guess is right. The memory problem occurs after
>> DMPlexCreateFromCellList and DMPlexDistribute. The mesh related memory in
>> the master processor is not released after that.
>>
>> The pseudo code I use is
>>
>> if (rank == 0) then !only the master processor read the mesh file
>> and create cell list
>>
>> call DMPlexCreateFromCellList(Petsc_Comm_World,ndim,num_cells, &
>> num_nodes,num_nodes_per_cell, &
>> Petsc_False,dmplex_cells,ndim, &
>> !use Petsc_True to create intermediate mesh entities (faces, edges),
>> dmplex_verts,dmda_flow%da,ierr)
>> !not work for prism for the current 3.8 version.
>> CHKERRQ(ierr)
>>
>> else !slave processors pass zero cells
>>
>> call DMPlexCreateFromCellList(Petsc_Comm_World,ndim,0,0, &
>> num_nodes_per_cell, &
>> Petsc_False,dmplex_cells,ndim, &
>> !use Petsc_True to create intermediate mesh entities (faces, edges),
>> dmplex_verts,dmda_flow%da,ierr)
>> !not work for prism for the current 3.8 version.
>> CHKERRQ(ierr)
>>
>> end if
>>
>> call DMPlexDistribute
>>
>> call DMDestroy(dmda_flow%da,ierr)
>> CHKERRQ(ierr)
>>
>> !c set the global mesh as distributed mesh
>> dmda_flow%da = distributedMesh
>>
>>
>> After calling the above functions, the memory usage for the test case
>> (no. points 953,433, nprocs 160) is shown below:
>> rank 0 PETSc memory current MB 1610.39 PETSc memory maximum MB
>> 1690.42
>> rank 151 PETSc memory current MB 105.00 PETSc memory maximum MB
>> 104.94
>> rank 98 PETSc memory current MB 106.02 PETSc memory maximum MB
>> 105.95
>> rank 18 PETSc memory current MB 106.17 PETSc memory maximum MB
>> 106.17
>>
>> Is there any function available in the master version that can release
>> this memory?
>>
> DMDestroy() releases this memory, UNLESS you are holding other objects
> that refer to it, like a vector from that DM.
>
> Well, I have some labels set before distribution. After distribution, the
> labels values are collected but not destroyed. I will try this to see if it
> makes big difference.
>
> Labels should be destroyed with the DM. Just make a small code that does
nothing but distribute the mesh and end. If you
run with -malloc_test you should see if everythign is destroyed properly.
Thanks,
Matt
> Thanks,
>
> danyang
>
>
> Thanks,
>
> Matt
>
>> Thanks,
>>
>> Danyang
>> On 2019-10-10 11:09 a.m., Mark Adams via petsc-users wrote:
>>
>> Now that I think about it, the partitioning and distribution can be done
>> with existing API, I would assume, like is done with matrices.
>>
>> I'm still wondering what the H5 format is. I assume that it is not built
>> for a hardwired number of processes to read in parallel and that the
>> parallel read is somewhat scalable.
>>
>> On Thu, Oct 10, 2019 at 12:13 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> A related question, what is the state of having something like a
>>> distributed DMPlexCreateFromCellList method, but maybe your H5 efforts
>>> would work. My bone modeling code is old and a pain, but the apps
>>> specialized serial mesh generator could write an H5 file instead of the
>>> current FEAP file. Then you reader, SNES and a large deformation plasticity
>>> element in PetscFE could replace my code, in the future.
>>>
>>> How does your H5 thing work? Is it basically a flat file (not
>>> partitioned) that is read in in parallel by slicing the cell lists, etc,
>>> using file seek or something equivalent, then reconstructing a local graph
>>> on each processor to give to say Parmetis, then completes the distribution
>>> with this reasonable partitioning? (this is what our current code does)
>>>
>>> Thanks,
>>> Mark
>>>
>>> On Thu, Oct 10, 2019 at 9:30 AM Dave May via petsc-users <
>>> petsc-users at mcs.anl.gov> wrote:
>>>
>>>>
>>>>
>>>> On Thu 10. Oct 2019 at 15:15, Matthew Knepley <knepley at gmail.com>
>>>> wrote:
>>>>
>>>>> On Thu, Oct 10, 2019 at 9:10 AM Dave May <dave.mayhem23 at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> On Thu 10. Oct 2019 at 15:04, Matthew Knepley <knepley at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Thu, Oct 10, 2019 at 8:41 AM Dave May <dave.mayhem23 at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Thu 10. Oct 2019 at 14:34, Matthew Knepley <knepley at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Thu, Oct 10, 2019 at 8:31 AM Dave May <dave.mayhem23 at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> On Thu, 10 Oct 2019 at 13:21, Matthew Knepley via petsc-users <
>>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 9, 2019 at 5:10 PM Danyang Su via petsc-users <
>>>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Dear All,
>>>>>>>>>>>>
>>>>>>>>>>>> I have a question regarding the maximum memory usage for the
>>>>>>>>>>>> scaling test. My code is written in Fortran with support for both
>>>>>>>>>>>> structured grid (DM) and unstructured grid (DMPlex). It looks like memory
>>>>>>>>>>>> consumption is much larger when DMPlex is used and finally causew
>>>>>>>>>>>> out_of_memory problem.
>>>>>>>>>>>>
>>>>>>>>>>>> Below are some test using both structured grid and unstructured
>>>>>>>>>>>> grid. The memory consumption by the code is estimated based on all
>>>>>>>>>>>> allocated arrays and PETSc memory consumption is estimated based on
>>>>>>>>>>>> PetscMemoryGetMaximumUsage.
>>>>>>>>>>>>
>>>>>>>>>>>> I just wonder why the PETSc memory consumption does not
>>>>>>>>>>>> decrease when number of processors increases. For structured grid (scenario
>>>>>>>>>>>> 7-9), the memory consumption decreases as number of processors increases.
>>>>>>>>>>>> However, for unstructured grid case (scenario 14-16), the memory for PETSc
>>>>>>>>>>>> part remains unchanged. When I run a larger case, the code crashes because
>>>>>>>>>>>> memory is ran out. The same case works on another cluster with 480GB memory
>>>>>>>>>>>> per node. Does this make sense?
>>>>>>>>>>>>
>>>>>>>>>>> We would need a finer breakdown of where memory is being used. I
>>>>>>>>>>> did this for a paper:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/jgrb.50217
>>>>>>>>>>>
>>>>>>>>>>> If the subdomains, the halo sizes can overwhelm the basic
>>>>>>>>>>> storage. It looks like the subdomains are big here,
>>>>>>>>>>> but things are not totally clear to me. It would be helpful to
>>>>>>>>>>> send the output of -log_view for each case since
>>>>>>>>>>> PETSc tries to keep track of allocated memory.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Matt - I'd guess that there is a sequential (non-partitioned)
>>>>>>>>>> mesh hanging around in memory.
>>>>>>>>>> Is it possible that he's created the PLEX object which is loaded
>>>>>>>>>> sequentially (stored and retained in memory and never released), and then
>>>>>>>>>> afterwards distributed?
>>>>>>>>>> This can never happen with the DMDA and the table verifies this.
>>>>>>>>>> If his code using the DMDA and DMPLEX are as identical as
>>>>>>>>>> possible (albeit the DM used), then a sequential mesh held in memory seems
>>>>>>>>>> the likely cause.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dang it, Dave is always right.
>>>>>>>>>
>>>>>>>>> How to prevent this?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I thought you/Lawrence/Vaclav/others... had developed and provided
>>>>>>>> support for a parallel DMPLEX load via a suitably defined plex specific H5
>>>>>>>> mesh file.
>>>>>>>>
>>>>>>>
>>>>>>> We have, but these tests looked like generated meshes.
>>>>>>>
>>>>>>
>>>>>> Great.
>>>>>>
>>>>>> So would a solution to the problem be to have the user modify their
>>>>>> code in the follow way:
>>>>>> * they move the mesh gen stage into a seperate exec which they call
>>>>>> offline (on a fat node with lots of memory), and dump the appropriate file
>>>>>> * they change their existing application to simply load that file in
>>>>>> parallel.
>>>>>>
>>>>>
>>>>> Yes.
>>>>>
>>>>>
>>>>>> If there were examples illustrating how to create the file which can
>>>>>> be loaded in parallel I think it would be very helpful for the user (and
>>>>>> many others)
>>>>>>
>>>>>
>>>>> I think Vaclav is going to add his examples as soon as we fix this
>>>>> parallel interpolation bug. I am praying for time in the latter
>>>>> part of October to do this.
>>>>>
>>>>
>>>>
>>>> Excellent news - thanks for the update and info.
>>>>
>>>> Cheers
>>>> Dave
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>>
>>>>> Matt
>>>>>
>>>>>
>>>>>> Cheers
>>>>>> Dave
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>>
>>>>>>>> Since it looks like you are okay with fairly regular meshes, I
>>>>>>>>> would construct the
>>>>>>>>> coarsest mesh you can, and then use
>>>>>>>>>
>>>>>>>>> -dm_refine <k>
>>>>>>>>>
>>>>>>>>> which is activated by DMSetFromOptions(). Make sure to call it
>>>>>>>>> after DMPlexDistribute(). It will regularly
>>>>>>>>> refine in parallel and should show good memory scaling as Dave
>>>>>>>>> says.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Matt
>>>>>>>>>>>
>>>>>>>>>>>> scenario no. points cell type DMPLex nprocs no. nodes mem per
>>>>>>>>>>>> node GB solver Rank 0 memory MB Rank 0 petsc memory MB Runtime
>>>>>>>>>>>> (sec)
>>>>>>>>>>>> 1 2121 rectangle no 40 1 200 GMRES,Hypre preconditioner 0.21
>>>>>>>>>>>> 41.6
>>>>>>>>>>>> 2 8241 rectangle no 40 1 200 GMRES,Hypre preconditioner 0.59
>>>>>>>>>>>> 51.84
>>>>>>>>>>>> 3 32481 rectangle no 40 1 200 GMRES,Hypre preconditioner 1.95
>>>>>>>>>>>> 59.1
>>>>>>>>>>>> 4 128961 rectangle no 40 1 200 GMRES,Hypre preconditioner 7.05
>>>>>>>>>>>> 89.71
>>>>>>>>>>>> 5 513921 rectangle no 40 1 200 GMRES,Hypre preconditioner 26.76
>>>>>>>>>>>> 110.58
>>>>>>>>>>>> 6 2051841 rectangle no 40 1 200 GMRES,Hypre preconditioner
>>>>>>>>>>>> 104.21 232.05
>>>>>>>>>>>> *7* *8199681* *rectangle* *no* *40* *1* *200* *GMRES,Hypre
>>>>>>>>>>>> preconditioner* *411.26* *703.27* *140.29*
>>>>>>>>>>>> *8* *8199681* *rectangle* *no* *80* *2* *200* *GMRES,Hypre
>>>>>>>>>>>> preconditioner* *206.6* *387.25* *62.04*
>>>>>>>>>>>> *9* *8199681* *rectangle* *no* *160* *4* *200* *GMRES,Hypre
>>>>>>>>>>>> preconditioner* *104.28* *245.3* *32.76*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 10 2121 triangle yes 40 1 200 GMRES,Hypre preconditioner 0.49
>>>>>>>>>>>> 61.78
>>>>>>>>>>>> 11 15090 triangle yes 40 1 200 GMRES,Hypre preconditioner 2.32
>>>>>>>>>>>> 96.61
>>>>>>>>>>>> 12 59847 triangle yes 40 1 200 GMRES,Hypre preconditioner 8.28
>>>>>>>>>>>> 176.14
>>>>>>>>>>>> 13 238568 triangle yes 40 1 200 GMRES,Hypre preconditioner
>>>>>>>>>>>> 31.89 573.73
>>>>>>>>>>>> *14* *953433* *triangle* *yes* *40* *1* *200* *GMRES,Hypre
>>>>>>>>>>>> preconditioner* *119.23* *2102.54* *44.11*
>>>>>>>>>>>> *15* *953433* *triangle* *yes* *80* *2* *200* *GMRES,Hypre
>>>>>>>>>>>> preconditioner* *72.99* *2123.8* *24.36*
>>>>>>>>>>>> *16* *953433* *triangle* *yes* *160* *4* *200* *GMRES,Hypre
>>>>>>>>>>>> preconditioner* *48.65* *2076.25* *14.87*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 17 55770 prism yes 40 1 200 GMRES,Hypre preconditioner 18.46
>>>>>>>>>>>> 219.39
>>>>>>>>>>>> 18 749814 prism yes 40 1 200 GMRES,Hypre preconditioner 149.86
>>>>>>>>>>>> 2412.39
>>>>>>>>>>>> 19 7000050 prism yes 40 to 640 1 to 16 200 GMRES,Hypre
>>>>>>>>>>>> preconditioner
>>>>>>>>>>>> out_of_memory
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *20* *7000050* *prism* *yes* *64* *2* *480* *GMRES,Hypre
>>>>>>>>>>>> preconditioner* *890.92* *17214.41*
>>>>>>>>>>>>
>>>>>>>>>>>> The error information of scenario 19 is shown below:
>>>>>>>>>>>>
>>>>>>>>>>>> kernel messages produced during job executions:
>>>>>>>>>>>> [Oct 9 10:41] mpiexec.hydra invoked oom-killer:
>>>>>>>>>>>> gfp_mask=0x200da, order=0, oom_score_adj=0
>>>>>>>>>>>> [ +0.010274] mpiexec.hydra cpuset=/ mems_allowed=0-1
>>>>>>>>>>>> [ +0.006680] CPU: 2 PID: 144904 Comm: mpiexec.hydra Tainted:
>>>>>>>>>>>> G OE ------------ 3.10.0-862.14.4.el7.x86_64 #1
>>>>>>>>>>>> [ +0.013365] Hardware name: Lenovo ThinkSystem SD530
>>>>>>>>>>>> -[7X21CTO1WW]-/-[7X21CTO1WW]-, BIOS -[TEE124N-1.40]- 06/12/2018
>>>>>>>>>>>> [ +0.012866] Call Trace:
>>>>>>>>>>>> [ +0.003945] [<ffffffffb3313754>] dump_stack+0x19/0x1b
>>>>>>>>>>>> [ +0.006995] [<ffffffffb330e91f>] dump_header+0x90/0x229
>>>>>>>>>>>> [ +0.007121] [<ffffffffb2cfa982>] ? ktime_get_ts64+0x52/0xf0
>>>>>>>>>>>> [ +0.007451] [<ffffffffb2d5141f>] ? delayacct_end+0x8f/0xb0
>>>>>>>>>>>> [ +0.007393] [<ffffffffb2d9ac94>] oom_kill_process+0x254/0x3d0
>>>>>>>>>>>> [ +0.007592] [<ffffffffb2d9a73d>] ?
>>>>>>>>>>>> oom_unkillable_task+0xcd/0x120
>>>>>>>>>>>> [ +0.007978] [<ffffffffb2d9a7e6>] ?
>>>>>>>>>>>> find_lock_task_mm+0x56/0xc0
>>>>>>>>>>>> [ +0.007729] [<ffffffffb2d9b4d6>] *out_of_memory+0x4b6/0x4f0*
>>>>>>>>>>>> [ +0.007358] [<ffffffffb330f423>]
>>>>>>>>>>>> __alloc_pages_slowpath+0x5d6/0x724
>>>>>>>>>>>> [ +0.008190] [<ffffffffb2da18b5>]
>>>>>>>>>>>> __alloc_pages_nodemask+0x405/0x420
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Danyang
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>>>>>> experiments lead.
>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>
>>>>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>>>> experiments lead.
>>>>>>>>> -- Norbert Wiener
>>>>>>>>>
>>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> What most experimenters take for granted before they begin their
>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>> experiments lead.
>>>>>>> -- Norbert Wiener
>>>>>>>
>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> What most experimenters take for granted before they begin their
>>>>> experiments is infinitely more interesting than any results to which their
>>>>> experiments lead.
>>>>> -- Norbert Wiener
>>>>>
>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>
>>>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20191010/c2cbeb0d/attachment-0001.html>
More information about the petsc-users
mailing list