[petsc-users] DMPlex memory problem in scaling test
Danyang Su
danyang.su at gmail.com
Thu Oct 10 18:53:10 CDT 2019
On 2019-10-10 4:28 p.m., Matthew Knepley wrote:
> On Thu, Oct 10, 2019 at 4:26 PM Danyang Su <danyang.su at gmail.com
> <mailto:danyang.su at gmail.com>> wrote:
>
> Hi All,
>
> Your guess is right. The memory problem occurs after
> DMPlexCreateFromCellList and DMPlexDistribute. The mesh related
> memory in the master processor is not released after that.
>
> The pseudo code I use is
>
> if (rank == 0) then !only the master processor read the
> mesh file and create cell list
>
> call
> DMPlexCreateFromCellList(Petsc_Comm_World,ndim,num_cells, &
> num_nodes,num_nodes_per_cell, &
> Petsc_False,dmplex_cells,ndim, & !use Petsc_True to create
> intermediate mesh entities (faces, edges),
> dmplex_verts,dmda_flow%da,ierr) !not work for prism for the
> current 3.8 version.
> CHKERRQ(ierr)
>
> else !slave processors pass zero cells
>
> call DMPlexCreateFromCellList(Petsc_Comm_World,ndim,0,0, &
> num_nodes_per_cell, &
> Petsc_False,dmplex_cells,ndim, & !use Petsc_True to create
> intermediate mesh entities (faces, edges),
> dmplex_verts,dmda_flow%da,ierr) !not work for prism for the
> current 3.8 version.
> CHKERRQ(ierr)
>
> end if
>
> call DMPlexDistribute
>
> call DMDestroy(dmda_flow%da,ierr)
> CHKERRQ(ierr)
>
> !c set the global mesh as distributed mesh
> dmda_flow%da = distributedMesh
>
>
> After calling the above functions, the memory usage for the test
> case (no. points 953,433, nprocs 160) is shown below:
> rank 0 PETSc memory current MB 1610.39 PETSc memory maximum
> MB 1690.42
> rank 151 PETSc memory current MB 105.00 PETSc memory maximum
> MB 104.94
> rank 98 PETSc memory current MB 106.02 PETSc memory maximum
> MB 105.95
> rank 18 PETSc memory current MB 106.17 PETSc memory maximum
> MB 106.17
>
> Is there any function available in the master version that can
> release this memory?
>
> DMDestroy() releases this memory, UNLESS you are holding other objects
> that refer to it, like a vector from that DM.
Well, I have some labels set before distribution. After distribution,
the labels values are collected but not destroyed. I will try this to
see if it makes big difference.
Thanks,
danyang
>
> Thanks,
>
> Matt
>
> Thanks,
>
> Danyang
>
> On 2019-10-10 11:09 a.m., Mark Adams via petsc-users wrote:
>> Now that I think about it, the partitioning and distribution can
>> be done with existing API, I would assume, like is done with
>> matrices.
>>
>> I'm still wondering what the H5 format is. I assume that it is
>> not built for a hardwired number of processes to read in parallel
>> and that the parallel read is somewhat scalable.
>>
>> On Thu, Oct 10, 2019 at 12:13 PM Mark Adams <mfadams at lbl.gov
>> <mailto:mfadams at lbl.gov>> wrote:
>>
>> A related question, what is the state of having something
>> like a distributed DMPlexCreateFromCellList method, but
>> maybe your H5 efforts would work. My bone modeling code is
>> old and a pain, but the apps specialized serial mesh
>> generator could write an H5 file instead of the current FEAP
>> file. Then you reader, SNES and a large deformation
>> plasticity element in PetscFE could replace my code, in the
>> future.
>>
>> How does your H5 thing work? Is it basically a flat file (not
>> partitioned) that is read in in parallel by slicing the cell
>> lists, etc, using file seek or something equivalent, then
>> reconstructing a local graph on each processor to give to say
>> Parmetis, then completes the distribution with this
>> reasonable partitioning? (this is what our current code does)
>>
>> Thanks,
>> Mark
>>
>> On Thu, Oct 10, 2019 at 9:30 AM Dave May via petsc-users
>> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>>
>>
>>
>> On Thu 10. Oct 2019 at 15:15, Matthew Knepley
>> <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
>>
>> On Thu, Oct 10, 2019 at 9:10 AM Dave May
>> <dave.mayhem23 at gmail.com
>> <mailto:dave.mayhem23 at gmail.com>> wrote:
>>
>> On Thu 10. Oct 2019 at 15:04, Matthew Knepley
>> <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
>>
>> On Thu, Oct 10, 2019 at 8:41 AM Dave May
>> <dave.mayhem23 at gmail.com
>> <mailto:dave.mayhem23 at gmail.com>> wrote:
>>
>> On Thu 10. Oct 2019 at 14:34, Matthew
>> Knepley <knepley at gmail.com
>> <mailto:knepley at gmail.com>> wrote:
>>
>> On Thu, Oct 10, 2019 at 8:31 AM Dave
>> May <dave.mayhem23 at gmail.com
>> <mailto:dave.mayhem23 at gmail.com>> wrote:
>>
>> On Thu, 10 Oct 2019 at 13:21,
>> Matthew Knepley via petsc-users
>> <petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>>
>> wrote:
>>
>> On Wed, Oct 9, 2019 at 5:10
>> PM Danyang Su via petsc-users
>> <petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>>
>> wrote:
>>
>> Dear All,
>>
>> I have a question
>> regarding the maximum
>> memory usage for the
>> scaling test. My code is
>> written in Fortran with
>> support for both
>> structured grid (DM) and
>> unstructured grid
>> (DMPlex). It looks like
>> memory consumption is
>> much larger when DMPlex
>> is used and finally
>> causew out_of_memory
>> problem.
>>
>> Below are some test using
>> both structured grid and
>> unstructured grid. The
>> memory consumption by the
>> code is estimated based
>> on all allocated arrays
>> and PETSc memory
>> consumption is estimated
>> based on
>> PetscMemoryGetMaximumUsage.
>>
>> I just wonder why the
>> PETSc memory consumption
>> does not decrease when
>> number of processors
>> increases. For structured
>> grid (scenario 7-9), the
>> memory consumption
>> decreases as number of
>> processors increases.
>> However, for unstructured
>> grid case (scenario
>> 14-16), the memory for
>> PETSc part remains
>> unchanged. When I run a
>> larger case, the code
>> crashes because memory is
>> ran out. The same case
>> works on another cluster
>> with 480GB memory per
>> node. Does this make sense?
>>
>> We would need a finer
>> breakdown of where memory is
>> being used. I did this for a
>> paper:
>>
>> https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/jgrb.50217
>>
>> If the subdomains, the halo
>> sizes can overwhelm the basic
>> storage. It looks like the
>> subdomains are big here,
>> but things are not totally
>> clear to me. It would be
>> helpful to send the output of
>> -log_view for each case since
>> PETSc tries to keep track of
>> allocated memory.
>>
>>
>> Matt - I'd guess that there is a
>> sequential (non-partitioned) mesh
>> hanging around in memory.
>> Is it possible that he's created
>> the PLEX object which is loaded
>> sequentially (stored and retained
>> in memory and never released),
>> and then afterwards distributed?
>> This can never happen with the
>> DMDA and the table verifies this.
>> If his code using the DMDA and
>> DMPLEX are as identical as
>> possible (albeit the DM used),
>> then a sequential mesh held in
>> memory seems the likely cause.
>>
>>
>> Dang it, Dave is always right.
>>
>> How to prevent this?
>>
>>
>> I thought you/Lawrence/Vaclav/others...
>> had developed and provided support for a
>> parallel DMPLEX load via a suitably
>> defined plex specific H5 mesh file.
>>
>>
>> We have, but these tests looked like
>> generated meshes.
>>
>>
>> Great.
>>
>> So would a solution to the problem be to have the
>> user modify their code in the follow way:
>> * they move the mesh gen stage into a seperate
>> exec which they call offline (on a fat node with
>> lots of memory), and dump the appropriate file
>> * they change their existing application to
>> simply load that file in parallel.
>>
>>
>> Yes.
>>
>> If there were examples illustrating how to create
>> the file which can be loaded in parallel I think
>> it would be very helpful for the user (and many
>> others)
>>
>>
>> I think Vaclav is going to add his examples as soon
>> as we fix this parallel interpolation bug. I am
>> praying for time in the latter
>> part of October to do this.
>>
>>
>>
>> Excellent news - thanks for the update and info.
>>
>> Cheers
>> Dave
>>
>>
>>
>> Thanks,
>>
>> Matt
>>
>> Cheers
>> Dave
>>
>>
>> Thanks,
>>
>> Matt
>>
>> Since it looks like you are okay with
>> fairly regular meshes, I would
>> construct the
>> coarsest mesh you can, and then use
>>
>> -dm_refine <k>
>>
>> which is activated by
>> DMSetFromOptions(). Make sure to call
>> it after DMPlexDistribute(). It will
>> regularly
>> refine in parallel and should show
>> good memory scaling as Dave says.
>>
>> Thanks,
>>
>> Matt
>>
>>
>> Thanks,
>>
>> Matt
>>
>> scenario no. points
>> cell type DMPLex nprocs
>> no. nodes mem per node
>> GB solver Rank 0 memory
>> MB Rank 0 petsc memory
>> MB Runtime (sec)
>> 1 2121 rectangle no
>> 40 1 200 GMRES,Hypre
>> preconditioner 0.21 41.6
>> 2 8241 rectangle no
>> 40 1 200 GMRES,Hypre
>> preconditioner 0.59 51.84
>> 3 32481 rectangle no
>> 40 1 200 GMRES,Hypre
>> preconditioner 1.95 59.1
>> 4 128961 rectangle no
>> 40 1 200 GMRES,Hypre
>> preconditioner 7.05 89.71
>> 5 513921 rectangle no
>> 40 1 200 GMRES,Hypre
>> preconditioner 26.76
>> 110.58
>> 6 2051841 rectangle no
>> 40 1 200 GMRES,Hypre
>> preconditioner 104.21
>> 232.05
>> *7* *8199681*
>> *rectangle* *no* *40*
>> *1* *200* *GMRES,Hypre
>> preconditioner* *411.26*
>> *703.27* *140.29*
>> *8* *8199681*
>> *rectangle* *no* *80*
>> *2* *200* *GMRES,Hypre
>> preconditioner* *206.6*
>> *387.25* *62.04*
>> *9* *8199681*
>> *rectangle* *no* *160*
>> *4* *200* *GMRES,Hypre
>> preconditioner* *104.28*
>> *245.3* *32.76*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 10 2121 triangle yes
>> 40 1 200 GMRES,Hypre
>> preconditioner 0.49 61.78
>> 11 15090 triangle yes
>> 40 1 200 GMRES,Hypre
>> preconditioner 2.32 96.61
>> 12 59847 triangle yes
>> 40 1 200 GMRES,Hypre
>> preconditioner 8.28
>> 176.14
>> 13 238568 triangle yes
>> 40 1 200 GMRES,Hypre
>> preconditioner 31.89
>> 573.73
>> *14* *953433*
>> *triangle* *yes* *40*
>> *1* *200* *GMRES,Hypre
>> preconditioner* *119.23*
>> *2102.54* *44.11*
>> *15* *953433*
>> *triangle* *yes* *80*
>> *2* *200* *GMRES,Hypre
>> preconditioner* *72.99*
>> *2123.8* *24.36*
>> *16* *953433*
>> *triangle* *yes* *160*
>> *4* *200* *GMRES,Hypre
>> preconditioner* *48.65*
>> *2076.25* *14.87*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 17 55770 prism yes 40
>> 1 200 GMRES,Hypre
>> preconditioner 18.46
>> 219.39
>> 18 749814 prism yes
>> 40 1 200 GMRES,Hypre
>> preconditioner 149.86
>> 2412.39
>> 19 7000050 prism yes
>> 40 to 640 1 to 16 200
>> GMRES,Hypre preconditioner
>> out_of_memory
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *20* *7000050* *prism*
>> *yes* *64* *2* *480*
>> *GMRES,Hypre
>> preconditioner* *890.92*
>> *17214.41*
>>
>> The error information of
>> scenario 19 is shown below:
>>
>> kernel messages produced
>> during job executions:
>> [Oct 9 10:41]
>> mpiexec.hydra invoked
>> oom-killer:
>> gfp_mask=0x200da,
>> order=0, oom_score_adj=0
>> [ +0.010274]
>> mpiexec.hydra cpuset=/
>> mems_allowed=0-1
>> [ +0.006680] CPU: 2 PID:
>> 144904 Comm:
>> mpiexec.hydra Tainted: G
>> OE ------------
>> 3.10.0-862.14.4.el7.x86_64 #1
>> [ +0.013365] Hardware
>> name: Lenovo ThinkSystem
>> SD530
>> -[7X21CTO1WW]-/-[7X21CTO1WW]-,
>> BIOS -[TEE124N-1.40]-
>> 06/12/2018
>> [ +0.012866] Call Trace:
>> [ +0.003945]
>> [<ffffffffb3313754>]
>> dump_stack+0x19/0x1b
>> [ +0.006995]
>> [<ffffffffb330e91f>]
>> dump_header+0x90/0x229
>> [ +0.007121]
>> [<ffffffffb2cfa982>] ?
>> ktime_get_ts64+0x52/0xf0
>> [ +0.007451]
>> [<ffffffffb2d5141f>] ?
>> delayacct_end+0x8f/0xb0
>> [ +0.007393]
>> [<ffffffffb2d9ac94>]
>> oom_kill_process+0x254/0x3d0
>> [ +0.007592]
>> [<ffffffffb2d9a73d>] ?
>> oom_unkillable_task+0xcd/0x120
>> [ +0.007978]
>> [<ffffffffb2d9a7e6>] ?
>> find_lock_task_mm+0x56/0xc0
>> [ +0.007729]
>> [<ffffffffb2d9b4d6>]
>> *out_of_memory+0x4b6/0x4f0*
>> [ +0.007358]
>> [<ffffffffb330f423>]
>> __alloc_pages_slowpath+0x5d6/0x724
>> [ +0.008190]
>> [<ffffffffb2da18b5>]
>> __alloc_pages_nodemask+0x405/0x420
>>
>> Thanks,
>>
>> Danyang
>>
>>
>>
>> --
>> What most experimenters take
>> for granted before they begin
>> their experiments is
>> infinitely more interesting
>> than any results to which
>> their experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>>
>>
>> --
>> What most experimenters take for
>> granted before they begin their
>> experiments is infinitely more
>> interesting than any results to which
>> their experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>>
>>
>> --
>> What most experimenters take for granted
>> before they begin their experiments is
>> infinitely more interesting than any results
>> to which their experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>>
>>
>> --
>> What most experimenters take for granted before they
>> begin their experiments is infinitely more
>> interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which
> their experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20191010/f6f01492/attachment-0001.html>
More information about the petsc-users
mailing list