[petsc-dev] Scaling test with ex13 (snes)

Sat Oct 3 12:48:55 CDT 2020

>
>
>
>
>     There is a MATPARTITIONINGHIERARCH (man page) that Fande provided that
> helped scaling up problems he was working on significantly.
>
>    Barry
>

The scaling issue with DMPlex is the one-to-all pattern of communication
that happens when distributing an original sequential mesh.
MATPARTITIONINGHIERARCH won't fix the issue.
In order to get reasonable performances when distributing a sequential mesh
on a large number of processes, you need at least two stages of
partitioning: an initial one from the sequential mesh to a mesh with one
process per node, migrate the PLEX data, then partition on each node
separately, and migrate the data again.

>
> On Oct 3, 2020, at 10:04 AM, Matthew Knepley <knepley at gmail.com> wrote:
>
> On Sat, Oct 3, 2020 at 10:51 AM Stefano Zampini <stefano.zampini at gmail.com>
> wrote:
>
>>
>>
>>
>>> Secondly, I'd like to add a multilevel "simple" partitioning in DMPlex
>>> to optimize communication. I am thinking that I can create a mesh with
>>> 'nnodes' cells and distribute that to 'nnodes*procs_node' processes with a
>>> "spread" distribution. (the default seems to be "compact"). Then refine
>>> that enough to get 'procs_node' more cells and the use a simple partitioner
>>> again to put one cell on each process, in such a way that the locality is
>>> preserved (not sure how that would work). Then refine from there on each
>>> proc for a scaling study.
>>>
>>>
>> Mark
>>
>> for multilevel partitioning, you need custom code, since what kills
>> performances with one-to-all patterns in DMPlex is the actual communication
>> of the mesh data.
>> However, you can always generate a mesh to have one cell per process, and
>> then refine from there.
>>
>> I have coded a multilevel partitioner that works quite well for
>> general meshes, we have it in a private repo with Lisandro. From my
>> experience, the benefits of using the multilevel scheme start from 4K
>> processes on. If you plan very large runs (say > 32K cores) then you
>> definitely want a multistage scheme.
>>
>> We never contributed the code since it requires some boilerplate code to
>> run through the stages of the partitioning and move the data.
>> If you are using hexas, you can always define your own "shell"
>> partitioner producing box decompositions.
>>
>
> I could integrate it if you want to stop maintaining it there :) It sounds
> really useful.
>
>   Thanks,
>
>      Matt
>
>
>> Another option is to generate the meshes upfront in sequential, and then
>> use the parallel HDF5 reader that Vaclav and Matt put together.
>>
>>
>>> The point here is to get communication patterns that look like an
>>> (idealized) well partition application. (I suppose I could take an array of
>>> factors, the product of which is the number of processors, and generalize
>>> this in a loop for any number of memory levels, or make an oct-tree).
>>>
>>> Any thoughts?
>>> Thanks,
>>> Mark
>>>
>>>
>>>
>>
>> --
>> Stefano
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>

-- 
Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20201003/2e014b57/attachment.html>