<div dir="ltr">I observed poor scaling with mat/tests/ex13 on Fugaku recently. <div>I was running this test as is (eg, no threads and 4 MPI processes per node/chip, which seems recomended). I did not dig into this.<div><div>A test with about 10% of the machine took about 45 minutes to run.<div><div>Mark</div></div></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Mar 6, 2021 at 9:49 PM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Mar 6, 2021 at 12:27 PM Matthew Knepley <<a href="mailto:knepley@buffalo.edu" target="_blank">knepley@buffalo.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Fri, Mar 5, 2021 at 4:06 PM Alexei Colin <<a href="mailto:acolin@isi.edu" target="_blank">acolin@isi.edu</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">To PETSc DMPlex users, Firedrake users, Dr. Knepley and Dr. Karpeev:<br>

<br>

Is it expected for mesh distribution step to<br>

(A) take a share of 50-99% of total time-to-solution of an FEM problem, and<br></blockquote><div><br></div><div>No</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

(B) take an amount of time that increases with the number of ranks, and<br></blockquote><div><br></div><div>See below.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

(C) take an amount of memory on rank 0 that does not decrease with the<br>

number of ranks<br></blockquote><div><br></div><div>The problem here is that a serial mesh is being partitioned and sent to all processes. This is fundamentally</div><div>non-scalable, but it is easy and works well for modest clusters < 100 nodes or so. Above this, it will take</div><div>increasing amounts of time. There are a few techniques for mitigating this.</div></div></div></blockquote><div>Is this one-to-all communication only done once?  If yes, one MPI_Scatterv() is enough and should not cost much.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div>a) For simple domains, you can distribute a coarse grid, then regularly refine that in parallel with DMRefine() or -dm_refine <k>.</div><div>    These steps can be repeated easily, and redistribution in parallel is fast, as shown for example in [1].</div><div><br></div><div>b) For complex meshes, you can read them in parallel, and then repeat a). This is done in [1]. It is a little more involved,</div><div>    but not much.</div><div><br></div><div>c) You can do a multilevel partitioning, as they do in [2]. I cannot find the paper in which they describe this right now. It is feasible,</div><div>     but definitely the most expert approach.</div><div><br></div><div>Does this make sense?</div><div><br></div><div>  Thanks,</div><div><br></div><div>    Matt</div><div><br></div><div>[1]  Fully Parallel Mesh I/O using PETSc DMPlex with an Application to Waveform Modeling, Hapla <a href="http://et.al" target="_blank">et.al</a>.</div><div>      <a href="https://arxiv.org/abs/2004.08729" target="_blank">https://arxiv.org/abs/2004.08729</a></div><div>[2] On the robustness and performance of entropy stable discontinuous collocation methods for the compressible Navier-Stokes equations, ROjas .<a href="http://et.al" target="_blank">et.al</a>.</div><div>      <a href="https://arxiv.org/abs/1911.10966" target="_blank">https://arxiv.org/abs/1911.10966</a></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

?<br>

<br>

The attached plots suggest (A), (B), and (C) is happening for<br>

Cahn-Hilliard problem (from firedrake-bench repo) on a 2D 8Kx8K<br>

unit-square mesh. The implementation is here [1]. Versions are<br>

Firedrake, PyOp2: 20200204.0; PETSc 3.13.1; ParMETIS 4.0.3.<br>

<br>

Two questions, one on (A) and the other on (B)+(C):<br>

<br>

1. Is (A) result expected? Given (A), any effort to improve the quality<br>

of the compiled assembly kernels (or anything else other than mesh<br>

distribution) appears futile since it takes 1% of end-to-end execution<br>

time, or am I missing something?<br>

<br>

1a. Is mesh distribution fundamentally necessary for any FEM framework,<br>

or is it only needed by Firedrake? If latter, then how do other<br>

frameworks partition the mesh and execute in parallel with MPI but avoid<br>

the non-scalable mesh destribution step?<br>

<br>

2. Results (B) and (C) suggest that the mesh distribution step does<br>

not scale. Is it a fundamental property of the mesh distribution problem<br>

that it has a central bottleneck in the master process, or is it<br>

a limitation of the current implementation in PETSc-DMPlex?<br>

<br>

2a. Our (B) result seems to agree with Figure 4(left) of [2]. Fig 6 of [2]<br>

suggests a way to reduce the time spent on sequential bottleneck by<br>

"parallel mesh refinment" that creates high-resolution meshes from an<br>

initial coarse mesh. Is this approach implemented in DMPLex?  If so, any<br>

pointers on how to try it out with Firedrake? If not, any other<br>

directions for reducing this bottleneck?<br>

<br>

2b. Fig 6 in [3] shows plots for Assembly and Solve steps that scale well up<br>

to 96 cores -- is mesh distribution included in those times?  Is anyone<br>

reading this aware of any other publications with evaluations of<br>

Firedrake that measure mesh distribution (or explain how to avoid or<br>

exclude it)?<br>

<br>

Thank you for your time and any info or tips.<br>

<br>

<br>

[1] <a href="https://github.com/ISI-apex/firedrake-bench/blob/master/cahn_hilliard/firedrake_cahn_hilliard_problem.py" rel="noreferrer" target="_blank">https://github.com/ISI-apex/firedrake-bench/blob/master/cahn_hilliard/firedrake_cahn_hilliard_problem.py</a><br>

<br>

[2] Unstructured Overlapping Mesh Distribution in Parallel, Matthew G.<br>

Knepley, Michael Lange, Gerard J. Gorman, 2015.<br>

<a href="https://arxiv.org/pdf/1506.06194.pdf" rel="noreferrer" target="_blank">https://arxiv.org/pdf/1506.06194.pdf</a><br>

<br>

[3] Efficient mesh management in Firedrake using PETSc-DMPlex, Michael<br>

Lange, Lawrence Mitchell, Matthew G. Knepley and Gerard J. Gorman, SISC,<br>

38(5), S143-S155, 2016. <a href="http://arxiv.org/abs/1506.07749" rel="noreferrer" target="_blank">http://arxiv.org/abs/1506.07749</a><br>

</blockquote></div></div>

</blockquote></div></div>

</blockquote></div>