[petsc-users] DMPlex memory problem in scaling test

Thu Oct 10 13:09:32 CDT 2019

Now that I think about it, the partitioning and distribution can be done
with existing API, I would assume, like is done with matrices.

I'm still wondering what the H5 format is. I assume that it is not built
for a hardwired number of processes to read in parallel and that the
parallel read is somewhat scalable.

On Thu, Oct 10, 2019 at 12:13 PM Mark Adams <mfadams at lbl.gov> wrote:

> A related question, what is the state of having something like a
> distributed  DMPlexCreateFromCellList method, but maybe your H5 efforts
> would work. My bone modeling code is old and a pain, but the apps
> specialized serial mesh generator could write an H5 file instead of the
> current FEAP file. Then you reader, SNES and a large deformation plasticity
> element in PetscFE could replace my code, in the future.
>
> How does your H5 thing work? Is it basically a flat file (not partitioned)
> that is read in in parallel by slicing the cell lists, etc, using file seek
> or something equivalent, then reconstructing a local graph on each
> processor to give to say Parmetis, then completes the distribution with
> this reasonable partitioning? (this is what our current code does)
>
> Thanks,
> Mark
>
> On Thu, Oct 10, 2019 at 9:30 AM Dave May via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
>
>>
>>
>> On Thu 10. Oct 2019 at 15:15, Matthew Knepley <knepley at gmail.com> wrote:
>>
>>> On Thu, Oct 10, 2019 at 9:10 AM Dave May <dave.mayhem23 at gmail.com>
>>> wrote:
>>>
>>>> On Thu 10. Oct 2019 at 15:04, Matthew Knepley <knepley at gmail.com>
>>>> wrote:
>>>>
>>>>> On Thu, Oct 10, 2019 at 8:41 AM Dave May <dave.mayhem23 at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> On Thu 10. Oct 2019 at 14:34, Matthew Knepley <knepley at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Thu, Oct 10, 2019 at 8:31 AM Dave May <dave.mayhem23 at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Thu, 10 Oct 2019 at 13:21, Matthew Knepley via petsc-users <
>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>
>>>>>>>>> On Wed, Oct 9, 2019 at 5:10 PM Danyang Su via petsc-users <
>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>
>>>>>>>>>> Dear All,
>>>>>>>>>>
>>>>>>>>>> I have a question regarding the maximum memory usage for the
>>>>>>>>>> scaling test. My code is written in Fortran with support for both
>>>>>>>>>> structured grid (DM) and unstructured grid (DMPlex). It looks like memory
>>>>>>>>>> consumption is much larger when DMPlex is used and finally causew
>>>>>>>>>> out_of_memory problem.
>>>>>>>>>>
>>>>>>>>>> Below are some test using both structured grid and unstructured
>>>>>>>>>> grid. The memory consumption by the code is estimated based on all
>>>>>>>>>> allocated arrays and PETSc memory consumption is estimated based on
>>>>>>>>>> PetscMemoryGetMaximumUsage.
>>>>>>>>>>
>>>>>>>>>> I just wonder why the PETSc memory consumption does not decrease
>>>>>>>>>> when number of processors increases. For structured grid (scenario 7-9),
>>>>>>>>>> the memory consumption decreases as number of processors increases.
>>>>>>>>>> However, for unstructured grid case (scenario 14-16), the memory for PETSc
>>>>>>>>>> part remains unchanged. When I run a larger case, the code crashes because
>>>>>>>>>> memory is ran out. The same case works on another cluster with 480GB memory
>>>>>>>>>> per node. Does this make sense?
>>>>>>>>>>
>>>>>>>>> We would need a finer breakdown of where memory is being used. I
>>>>>>>>> did this for a paper:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/jgrb.50217
>>>>>>>>>
>>>>>>>>> If the subdomains, the halo sizes can overwhelm the basic storage.
>>>>>>>>> It looks like the subdomains are big here,
>>>>>>>>> but things are not totally clear to me. It would be helpful to
>>>>>>>>> send the output of -log_view for each case since
>>>>>>>>> PETSc tries to keep track of allocated memory.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Matt - I'd guess that there is a sequential (non-partitioned) mesh
>>>>>>>> hanging around in memory.
>>>>>>>> Is it possible that he's created the PLEX object which is loaded
>>>>>>>> sequentially (stored and retained in memory and never released), and then
>>>>>>>> afterwards distributed?
>>>>>>>> This can never happen with the DMDA and the table verifies this.
>>>>>>>> If his code using the DMDA and DMPLEX are as identical as possible
>>>>>>>> (albeit the DM used), then a sequential mesh held in memory seems the
>>>>>>>> likely cause.
>>>>>>>>
>>>>>>>
>>>>>>> Dang it, Dave is always right.
>>>>>>>
>>>>>>> How to prevent this?
>>>>>>>
>>>>>>
>>>>>> I thought you/Lawrence/Vaclav/others... had developed and provided
>>>>>> support  for a parallel DMPLEX load via a suitably defined plex specific H5
>>>>>> mesh file.
>>>>>>
>>>>>
>>>>> We have, but these tests looked like generated meshes.
>>>>>
>>>>
>>>> Great.
>>>>
>>>> So would a solution to the problem be to have the user modify their
>>>> code in the follow way:
>>>> * they move the mesh gen stage into a seperate exec which they call
>>>> offline (on a fat node with lots of memory), and dump the appropriate file
>>>> * they change their existing application to simply load that file in
>>>> parallel.
>>>>
>>>
>>> Yes.
>>>
>>>
>>>> If there were examples illustrating how to create the file which can be
>>>> loaded in parallel I think it would be very helpful for the user (and many
>>>> others)
>>>>
>>>
>>> I think Vaclav is going to add his examples as soon as we fix this
>>> parallel interpolation bug. I am praying for time in the latter
>>> part of October to do this.
>>>
>>
>>
>> Excellent news - thanks for the update and info.
>>
>> Cheers
>> Dave
>>
>>
>>
>>>   Thanks,
>>>
>>>     Matt
>>>
>>>
>>>> Cheers
>>>> Dave
>>>>
>>>>
>>>>>   Thanks,
>>>>>
>>>>>     Matt
>>>>>
>>>>>
>>>>>> Since it looks like you are okay with fairly regular meshes, I would
>>>>>>> construct the
>>>>>>> coarsest mesh you can, and then use
>>>>>>>
>>>>>>>   -dm_refine <k>
>>>>>>>
>>>>>>> which is activated by DMSetFromOptions(). Make sure to call it after
>>>>>>> DMPlexDistribute(). It will regularly
>>>>>>> refine in parallel and should show good memory scaling as Dave says.
>>>>>>>
>>>>>>>   Thanks,
>>>>>>>
>>>>>>>      Matt
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>   Thanks,
>>>>>>>>>
>>>>>>>>>      Matt
>>>>>>>>>
>>>>>>>>>> scenario no. points cell type DMPLex nprocs no. nodes mem per
>>>>>>>>>> node GB solver Rank 0 memory MB Rank 0 petsc memory MB Runtime
>>>>>>>>>> (sec)
>>>>>>>>>> 1 2121 rectangle no 40 1 200 GMRES,Hypre preconditioner 0.21 41.6
>>>>>>>>>> 2 8241 rectangle no 40 1 200 GMRES,Hypre preconditioner 0.59
>>>>>>>>>> 51.84
>>>>>>>>>> 3 32481 rectangle no 40 1 200 GMRES,Hypre preconditioner 1.95
>>>>>>>>>> 59.1
>>>>>>>>>> 4 128961 rectangle no 40 1 200 GMRES,Hypre preconditioner 7.05
>>>>>>>>>> 89.71
>>>>>>>>>> 5 513921 rectangle no 40 1 200 GMRES,Hypre preconditioner 26.76
>>>>>>>>>> 110.58
>>>>>>>>>> 6 2051841 rectangle no 40 1 200 GMRES,Hypre preconditioner 104.21
>>>>>>>>>> 232.05
>>>>>>>>>> *7* *8199681* *rectangle* *no* *40* *1* *200* *GMRES,Hypre
>>>>>>>>>> preconditioner* *411.26* *703.27* *140.29*
>>>>>>>>>> *8* *8199681* *rectangle* *no* *80* *2* *200* *GMRES,Hypre
>>>>>>>>>> preconditioner* *206.6* *387.25* *62.04*
>>>>>>>>>> *9* *8199681* *rectangle* *no* *160* *4* *200* *GMRES,Hypre
>>>>>>>>>> preconditioner* *104.28* *245.3* *32.76*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 10 2121 triangle yes 40 1 200 GMRES,Hypre preconditioner 0.49
>>>>>>>>>> 61.78
>>>>>>>>>> 11 15090 triangle yes 40 1 200 GMRES,Hypre preconditioner 2.32
>>>>>>>>>> 96.61
>>>>>>>>>> 12 59847 triangle yes 40 1 200 GMRES,Hypre preconditioner 8.28
>>>>>>>>>> 176.14
>>>>>>>>>> 13 238568 triangle yes 40 1 200 GMRES,Hypre preconditioner 31.89
>>>>>>>>>> 573.73
>>>>>>>>>> *14* *953433* *triangle* *yes* *40* *1* *200* *GMRES,Hypre
>>>>>>>>>> preconditioner* *119.23* *2102.54* *44.11*
>>>>>>>>>> *15* *953433* *triangle* *yes* *80* *2* *200* *GMRES,Hypre
>>>>>>>>>> preconditioner* *72.99* *2123.8* *24.36*
>>>>>>>>>> *16* *953433* *triangle* *yes* *160* *4* *200* *GMRES,Hypre
>>>>>>>>>> preconditioner* *48.65* *2076.25* *14.87*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 17 55770 prism yes 40 1 200 GMRES,Hypre preconditioner 18.46
>>>>>>>>>> 219.39
>>>>>>>>>> 18 749814 prism yes 40 1 200 GMRES,Hypre preconditioner 149.86
>>>>>>>>>> 2412.39
>>>>>>>>>> 19 7000050 prism yes 40 to 640 1 to 16 200 GMRES,Hypre
>>>>>>>>>> preconditioner
>>>>>>>>>> out_of_memory
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *20* *7000050* *prism* *yes* *64* *2* *480* *GMRES,Hypre
>>>>>>>>>> preconditioner* *890.92* *17214.41*
>>>>>>>>>>
>>>>>>>>>> The error information of scenario 19 is shown below:
>>>>>>>>>>
>>>>>>>>>> kernel messages produced during job executions:
>>>>>>>>>> [Oct 9 10:41] mpiexec.hydra invoked oom-killer: gfp_mask=0x200da,
>>>>>>>>>> order=0, oom_score_adj=0
>>>>>>>>>> [  +0.010274] mpiexec.hydra cpuset=/ mems_allowed=0-1
>>>>>>>>>> [  +0.006680] CPU: 2 PID: 144904 Comm: mpiexec.hydra Tainted:
>>>>>>>>>> G           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
>>>>>>>>>> [  +0.013365] Hardware name: Lenovo ThinkSystem SD530
>>>>>>>>>> -[7X21CTO1WW]-/-[7X21CTO1WW]-, BIOS -[TEE124N-1.40]- 06/12/2018
>>>>>>>>>> [  +0.012866] Call Trace:
>>>>>>>>>> [  +0.003945]  [<ffffffffb3313754>] dump_stack+0x19/0x1b
>>>>>>>>>> [  +0.006995]  [<ffffffffb330e91f>] dump_header+0x90/0x229
>>>>>>>>>> [  +0.007121]  [<ffffffffb2cfa982>] ? ktime_get_ts64+0x52/0xf0
>>>>>>>>>> [  +0.007451]  [<ffffffffb2d5141f>] ? delayacct_end+0x8f/0xb0
>>>>>>>>>> [  +0.007393]  [<ffffffffb2d9ac94>] oom_kill_process+0x254/0x3d0
>>>>>>>>>> [  +0.007592]  [<ffffffffb2d9a73d>] ?
>>>>>>>>>> oom_unkillable_task+0xcd/0x120
>>>>>>>>>> [  +0.007978]  [<ffffffffb2d9a7e6>] ? find_lock_task_mm+0x56/0xc0
>>>>>>>>>> [  +0.007729]  [<ffffffffb2d9b4d6>] *out_of_memory+0x4b6/0x4f0*
>>>>>>>>>> [  +0.007358]  [<ffffffffb330f423>]
>>>>>>>>>> __alloc_pages_slowpath+0x5d6/0x724
>>>>>>>>>> [  +0.008190]  [<ffffffffb2da18b5>]
>>>>>>>>>> __alloc_pages_nodemask+0x405/0x420
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Danyang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>>>> experiments lead.
>>>>>>>>> -- Norbert Wiener
>>>>>>>>>
>>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> What most experimenters take for granted before they begin their
>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>> experiments lead.
>>>>>>> -- Norbert Wiener
>>>>>>>
>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> What most experimenters take for granted before they begin their
>>>>> experiments is infinitely more interesting than any results to which their
>>>>> experiments lead.
>>>>> -- Norbert Wiener
>>>>>
>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>
>>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20191010/784e763d/attachment-0001.html>