<div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><br><br><div><br></div><div>    There is a MATPARTITIONINGHIERARCH (man page) that Fande provided that helped scaling up problems he was working on significantly.</div><div><br></div><div>   Barry</div></div></blockquote><div><br></div><div>The scaling issue with DMPlex is the one-to-all pattern of communication that happens when distributing an original sequential mesh. MATPARTITIONINGHIERARCH won't fix the issue.</div><div>In order to get reasonable performances when distributing a sequential mesh on a large number of processes, you need at least two stages of partitioning: an initial one from the sequential mesh to a mesh with one process per node, migrate the PLEX data, then partition on each node separately, and migrate the data again.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div><div><br><blockquote type="cite"><div>On Oct 3, 2020, at 10:04 AM, Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:</div><br><div><div dir="ltr"><div dir="ltr">On Sat, Oct 3, 2020 at 10:51 AM Stefano Zampini <<a href="mailto:stefano.zampini@gmail.com" target="_blank">stefano.zampini@gmail.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr"><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br></div><div>Secondly, I'd like to add a multilevel "simple" partitioning in DMPlex to optimize communication. I am thinking that I can create a mesh with 'nnodes' cells and distribute that to 'nnodes*procs_node' processes with a "spread" distribution. (the default seems to be "compact"). Then refine that enough to get 'procs_node' more cells and the use a simple partitioner again to put one cell on each process, in such a way that the locality is preserved (not sure how that would work). Then refine from there on each proc for a scaling study.</div><div><br></div></div></blockquote><div><br></div><div>Mark</div><div><br></div><div>for multilevel partitioning, you need custom code, since what kills performances with one-to-all patterns in DMPlex is the actual communication of the mesh data.</div><div>However, you can always generate a mesh to have one cell per process, and then refine from there.</div><div><br></div><div>I have coded a multilevel partitioner that works quite well for general meshes, we have it in a private repo with Lisandro. From my experience, the benefits of using the multilevel scheme start from 4K processes on. If you plan very large runs (say > 32K cores) then you definitely want a multistage scheme.</div><div><br></div><div>We never contributed the code since it requires some boilerplate code to run through the stages of the partitioning and move the data.</div><div>If you are using hexas, you can always define your own "shell" partitioner producing box decompositions.</div></div></div></blockquote><div><br></div><div>I could integrate it if you want to stop maintaining it there :) It sounds really useful.</div><div><br></div><div>  Thanks,</div><div><br></div><div>     Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div>Another option is to generate the meshes upfront in sequential, and then use the parallel HDF5 reader that Vaclav and Matt put together.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div></div><div>The point here is to get communication patterns that look like an (idealized) well partition application. (I suppose I could take an array of factors, the product of which is the number of processors, and generalize this in a loop for any number of memory levels, or make an oct-tree).</div><div><br></div><div>Any thoughts?</div><div>Thanks,</div><div>Mark</div><div><br></div><div><br></div></div>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr">Stefano</div></div>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>

</div></blockquote></div><br></div></div></blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature">Stefano</div></div>