<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div> mat/tests/ex13.c creates a sequential AIJ matrix, converts it to the same format, reorders it and then prints it and the reordering in ASCII. Each of these steps is sequential and takes place on each rank. The prints are ASCII stdout on the ranks.<div class=""><br class=""></div><div class=""><div class=""> ierr = MatCreateSeqAIJ(PETSC_COMM_SELF,m*n,m*n,5,NULL,&C);CHKERRQ(ierr);</div><div class=""> /* create the matrix for the five point stencil, YET AGAIN*/</div><div class=""> for (i=0; i<m; i++) {</div><div class=""> for (j=0; j<n; j++) {</div><div class=""> v = -1.0; Ii = j + n*i;</div><div class=""> if (i>0) {J = Ii - n; ierr = MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}</div><div class=""> if (i<m-1) {J = Ii + n; ierr = MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}</div><div class=""> if (j>0) {J = Ii - 1; ierr = MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}</div><div class=""> if (j<n-1) {J = Ii + 1; ierr = MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}</div><div class=""> v = 4.0; ierr = MatSetValues(C,1,&Ii,1,&Ii,&v,INSERT_VALUES);CHKERRQ(ierr);</div><div class=""> }</div><div class=""> }</div><div class=""> ierr = MatAssemblyBegin(C,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);</div><div class=""> ierr = MatAssemblyEnd(C,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);</div><div class=""><br class=""></div><div class=""> ierr = MatConvert(C,MATSAME,MAT_INITIAL_MATRIX,&A);CHKERRQ(ierr);</div><div class=""><br class=""></div><div class=""> ierr = MatGetOrdering(A,MATORDERINGND,&perm,&iperm);CHKERRQ(ierr);</div><div class=""> ierr = ISView(perm,PETSC_VIEWER_STDOUT_SELF);CHKERRQ(ierr);</div><div class=""> ierr = ISView(iperm,PETSC_VIEWER_STDOUT_SELF);CHKERRQ(ierr);</div><div class=""> ierr = MatView(A,PETSC_VIEWER_STDOUT_SELF);CHKERRQ(ierr);</div><div class=""><br class=""></div><div class="">I think each rank would simply be running the same code and dumping everything to its own stdout.</div><div class=""><br class=""></div><div class="">At some point within the system/MPI executor there is code that merges and print outs the stdout of each rank. If the test does truly take 45 minutes than Fugaku has a classic bug of not being able to efficiently merge stdout from each of the ranks. Nothing really to do with PETSc, just neglect of Fugaku developers to respect all aspects of developing a HPC system. Heck, they only had a billion dollars, can't expect them to do what other scalable systems do :-). </div><div class=""><br class=""></div><div class="">One should be able to reproduce this with a simple MPI program that prints a moderate amount of data to stdout on each rank.</div><div class=""><br class=""></div><div class=""> Barry</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div><br class=""><blockquote type="cite" class=""><div class="">On Mar 6, 2021, at 9:46 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" class="">mfadams@lbl.gov</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">I observed poor scaling with mat/tests/ex13 on Fugaku recently. <div class="">I was running this test as is (eg, no threads and 4 MPI processes per node/chip, which seems recomended). I did not dig into this.<div class=""><div class="">A test with about 10% of the machine took about 45 minutes to run.<div class=""><div class="">Mark</div></div></div></div></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Mar 6, 2021 at 9:49 PM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" class="">junchao.zhang@gmail.com</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div dir="ltr" class=""><br class=""><br class=""></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Mar 6, 2021 at 12:27 PM Matthew Knepley <<a href="mailto:knepley@buffalo.edu" target="_blank" class="">knepley@buffalo.edu</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div dir="ltr" class="">On Fri, Mar 5, 2021 at 4:06 PM Alexei Colin <<a href="mailto:acolin@isi.edu" target="_blank" class="">acolin@isi.edu</a>> wrote:<br class=""></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">To PETSc DMPlex users, Firedrake users, Dr. Knepley and Dr. Karpeev:<br class="">
<br class="">
Is it expected for mesh distribution step to<br class="">
(A) take a share of 50-99% of total time-to-solution of an FEM problem, and<br class=""></blockquote><div class=""><br class=""></div><div class="">No</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
(B) take an amount of time that increases with the number of ranks, and<br class=""></blockquote><div class=""><br class=""></div><div class="">See below.</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
(C) take an amount of memory on rank 0 that does not decrease with the<br class="">
number of ranks<br class=""></blockquote><div class=""><br class=""></div><div class="">The problem here is that a serial mesh is being partitioned and sent to all processes. This is fundamentally</div><div class="">non-scalable, but it is easy and works well for modest clusters < 100 nodes or so. Above this, it will take</div><div class="">increasing amounts of time. There are a few techniques for mitigating this.</div></div></div></blockquote><div class="">Is this one-to-all communication only done once? If yes, one MPI_Scatterv() is enough and should not cost much.</div><div class=""><br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class="gmail_quote"><div class="">a) For simple domains, you can distribute a coarse grid, then regularly refine that in parallel with DMRefine() or -dm_refine <k>.</div><div class=""> These steps can be repeated easily, and redistribution in parallel is fast, as shown for example in [1].</div><div class=""><br class=""></div><div class="">b) For complex meshes, you can read them in parallel, and then repeat a). This is done in [1]. It is a little more involved,</div><div class=""> but not much.</div><div class=""><br class=""></div><div class="">c) You can do a multilevel partitioning, as they do in [2]. I cannot find the paper in which they describe this right now. It is feasible,</div><div class=""> but definitely the most expert approach.</div><div class=""><br class=""></div><div class="">Does this make sense?</div><div class=""><br class=""></div><div class=""> Thanks,</div><div class=""><br class=""></div><div class=""> Matt</div><div class=""><br class=""></div><div class="">[1] Fully Parallel Mesh I/O using PETSc DMPlex with an Application to Waveform Modeling, Hapla <a href="http://et.al/" target="_blank" class="">et.al</a>.</div><div class=""> <a href="https://arxiv.org/abs/2004.08729" target="_blank" class="">https://arxiv.org/abs/2004.08729</a></div><div class="">[2] On the robustness and performance of entropy stable discontinuous collocation methods for the compressible Navier-Stokes equations, ROjas .<a href="http://et.al/" target="_blank" class="">et.al</a>.</div><div class=""> <a href="https://arxiv.org/abs/1911.10966" target="_blank" class="">https://arxiv.org/abs/1911.10966</a></div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
?<br class="">
<br class="">
The attached plots suggest (A), (B), and (C) is happening for<br class="">
Cahn-Hilliard problem (from firedrake-bench repo) on a 2D 8Kx8K<br class="">
unit-square mesh. The implementation is here [1]. Versions are<br class="">
Firedrake, PyOp2: 20200204.0; PETSc 3.13.1; ParMETIS 4.0.3.<br class="">
<br class="">
Two questions, one on (A) and the other on (B)+(C):<br class="">
<br class="">
1. Is (A) result expected? Given (A), any effort to improve the quality<br class="">
of the compiled assembly kernels (or anything else other than mesh<br class="">
distribution) appears futile since it takes 1% of end-to-end execution<br class="">
time, or am I missing something?<br class="">
<br class="">
1a. Is mesh distribution fundamentally necessary for any FEM framework,<br class="">
or is it only needed by Firedrake? If latter, then how do other<br class="">
frameworks partition the mesh and execute in parallel with MPI but avoid<br class="">
the non-scalable mesh destribution step?<br class="">
<br class="">
2. Results (B) and (C) suggest that the mesh distribution step does<br class="">
not scale. Is it a fundamental property of the mesh distribution problem<br class="">
that it has a central bottleneck in the master process, or is it<br class="">
a limitation of the current implementation in PETSc-DMPlex?<br class="">
<br class="">
2a. Our (B) result seems to agree with Figure 4(left) of [2]. Fig 6 of [2]<br class="">
suggests a way to reduce the time spent on sequential bottleneck by<br class="">
"parallel mesh refinment" that creates high-resolution meshes from an<br class="">
initial coarse mesh. Is this approach implemented in DMPLex? If so, any<br class="">
pointers on how to try it out with Firedrake? If not, any other<br class="">
directions for reducing this bottleneck?<br class="">
<br class="">
2b. Fig 6 in [3] shows plots for Assembly and Solve steps that scale well up<br class="">
to 96 cores -- is mesh distribution included in those times? Is anyone<br class="">
reading this aware of any other publications with evaluations of<br class="">
Firedrake that measure mesh distribution (or explain how to avoid or<br class="">
exclude it)?<br class="">
<br class="">
Thank you for your time and any info or tips.<br class="">
<br class="">
<br class="">
[1] <a href="https://github.com/ISI-apex/firedrake-bench/blob/master/cahn_hilliard/firedrake_cahn_hilliard_problem.py" rel="noreferrer" target="_blank" class="">https://github.com/ISI-apex/firedrake-bench/blob/master/cahn_hilliard/firedrake_cahn_hilliard_problem.py</a><br class="">
<br class="">
[2] Unstructured Overlapping Mesh Distribution in Parallel, Matthew G.<br class="">
Knepley, Michael Lange, Gerard J. Gorman, 2015.<br class="">
<a href="https://arxiv.org/pdf/1506.06194.pdf" rel="noreferrer" target="_blank" class="">https://arxiv.org/pdf/1506.06194.pdf</a><br class="">
<br class="">
[3] Efficient mesh management in Firedrake using PETSc-DMPlex, Michael<br class="">
Lange, Lawrence Mitchell, Matthew G. Knepley and Gerard J. Gorman, SISC,<br class="">
38(5), S143-S155, 2016. <a href="http://arxiv.org/abs/1506.07749" rel="noreferrer" target="_blank" class="">http://arxiv.org/abs/1506.07749</a><br class="">
</blockquote></div></div>
</blockquote></div></div>
</blockquote></div>
</div></blockquote></div><br class=""></div></div></div></body></html>