Mark Adams mfadams at lbl.gov
Sun Mar 7 07:06:49 CST 2021

Whoop, snes/tests/ex13.c.
This is what I used for the Summit runs that I presented a while ago.

On Sun, Mar 7, 2021 at 6:12 AM Barry Smith <bsmith at petsc.dev> wrote:

>   mat/tests/ex13.c creates a sequential AIJ matrix, converts it to the
> same format, reorders it and then prints it and the reordering in ASCII.
> Each of these steps is sequential and takes place on each rank. The prints
> are ASCII stdout on the ranks.
>   ierr = MatCreateSeqAIJ(PETSC_COMM_SELF,m*n,m*n,5,NULL,&C);CHKERRQ(ierr);
>   /* create the matrix for the five point stencil, YET AGAIN*/
>   for (i=0; i<m; i++) {
>     for (j=0; j<n; j++) {
>       v = -1.0;  Ii = j + n*i;
>       if (i>0)   {J = Ii - n; ierr =
> MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
>       if (i<m-1) {J = Ii + n; ierr =
> MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
>       if (j>0)   {J = Ii - 1; ierr =
> MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
>       if (j<n-1) {J = Ii + 1; ierr =
> MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
>       v = 4.0; ierr =
> MatSetValues(C,1,&Ii,1,&Ii,&v,INSERT_VALUES);CHKERRQ(ierr);
>     }
>   }
>   ierr = MatAssemblyBegin(C,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
>   ierr = MatAssemblyEnd(C,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
>   ierr = MatConvert(C,MATSAME,MAT_INITIAL_MATRIX,&A);CHKERRQ(ierr);
>   ierr = MatGetOrdering(A,MATORDERINGND,&perm,&iperm);CHKERRQ(ierr);
>   ierr = ISView(perm,PETSC_VIEWER_STDOUT_SELF);CHKERRQ(ierr);
>   ierr = ISView(iperm,PETSC_VIEWER_STDOUT_SELF);CHKERRQ(ierr);
> I think each rank would simply be running the same code and dumping
> everything to its own stdout.
> At some point within the system/MPI executor there is code that merges and
> print outs the stdout of each rank. If the test does truly take 45 minutes
> than Fugaku has a classic bug of not being able to efficiently merge stdout
> from each of the ranks. Nothing really to do with PETSc, just neglect of
> Fugaku developers to respect all aspects of developing a HPC system. Heck,
> they only had a billion dollars, can't expect them to do what other
> scalable systems do :-).
> One should be able to reproduce this with a simple MPI program that prints
> a moderate amount of data to stdout on each rank.
>  Barry
> On Mar 6, 2021, at 9:46 PM, Mark Adams <mfadams at lbl.gov> wrote:
> I observed poor scaling with mat/tests/ex13 on Fugaku recently.
> I was running this test as is (eg, no threads and 4 MPI processes per
> node/chip, which seems recomended). I did not dig into this.
> A test with about 10% of the machine took about 45 minutes to run.
> Mark
> On Sat, Mar 6, 2021 at 9:49 PM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>> On Sat, Mar 6, 2021 at 12:27 PM Matthew Knepley <knepley at buffalo.edu>
>> wrote:
>>> On Fri, Mar 5, 2021 at 4:06 PM Alexei Colin <acolin at isi.edu> wrote:
>>>> To PETSc DMPlex users, Firedrake users, Dr. Knepley and Dr. Karpeev:
>>>> Is it expected for mesh distribution step to
>>>> (A) take a share of 50-99% of total time-to-solution of an FEM problem,
>>>> and
>>> No
>>>> (B) take an amount of time that increases with the number of ranks, and
>>> See below.
>>>> (C) take an amount of memory on rank 0 that does not decrease with the
>>>> number of ranks
>>> The problem here is that a serial mesh is being partitioned and sent to
>>> all processes. This is fundamentally
>>> non-scalable, but it is easy and works well for modest clusters < 100
>>> nodes or so. Above this, it will take
>>> increasing amounts of time. There are a few techniques for mitigating
>>> this.
>> Is this one-to-all communication only done once?  If yes, one
>> MPI_Scatterv() is enough and should not cost much.
>> a) For simple domains, you can distribute a coarse grid, then regularly
>>> refine that in parallel with DMRefine() or -dm_refine <k>.
>>>     These steps can be repeated easily, and redistribution in parallel
>>> is fast, as shown for example in [1].
>>> b) For complex meshes, you can read them in parallel, and then repeat
>>> a). This is done in [1]. It is a little more involved,
>>>     but not much.
>>> c) You can do a multilevel partitioning, as they do in [2]. I cannot
>>> find the paper in which they describe this right now. It is feasible,
>>>      but definitely the most expert approach.
>>> Does this make sense?
>>>   Thanks,
>>>     Matt
>>> [1]  Fully Parallel Mesh I/O using PETSc DMPlex with an Application to
>>> Waveform Modeling, Hapla et.al.
>>>       https://arxiv.org/abs/2004.08729
>>> [2] On the robustness and performance of entropy stable discontinuous
>>> collocation methods for the compressible Navier-Stokes equations, ROjas .
>>> et.al.
>>>       https://arxiv.org/abs/1911.10966
>>>> ?
>>>> The attached plots suggest (A), (B), and (C) is happening for
>>>> Cahn-Hilliard problem (from firedrake-bench repo) on a 2D 8Kx8K
>>>> unit-square mesh. The implementation is here [1]. Versions are
>>>> Firedrake, PyOp2: 20200204.0; PETSc 3.13.1; ParMETIS 4.0.3.
>>>> Two questions, one on (A) and the other on (B)+(C):
>>>> 1. Is (A) result expected? Given (A), any effort to improve the quality
>>>> of the compiled assembly kernels (or anything else other than mesh
>>>> distribution) appears futile since it takes 1% of end-to-end execution
>>>> time, or am I missing something?
>>>> 1a. Is mesh distribution fundamentally necessary for any FEM framework,
>>>> or is it only needed by Firedrake? If latter, then how do other
>>>> frameworks partition the mesh and execute in parallel with MPI but avoid
>>>> the non-scalable mesh destribution step?
>>>> 2. Results (B) and (C) suggest that the mesh distribution step does
>>>> not scale. Is it a fundamental property of the mesh distribution problem
>>>> that it has a central bottleneck in the master process, or is it
>>>> a limitation of the current implementation in PETSc-DMPlex?
>>>> 2a. Our (B) result seems to agree with Figure 4(left) of [2]. Fig 6 of
>>>> [2]
>>>> suggests a way to reduce the time spent on sequential bottleneck by
>>>> "parallel mesh refinment" that creates high-resolution meshes from an
>>>> initial coarse mesh. Is this approach implemented in DMPLex?  If so, any
>>>> pointers on how to try it out with Firedrake? If not, any other
>>>> directions for reducing this bottleneck?
>>>> 2b. Fig 6 in [3] shows plots for Assembly and Solve steps that scale
>>>> well up
>>>> to 96 cores -- is mesh distribution included in those times?  Is anyone
>>>> reading this aware of any other publications with evaluations of
>>>> Firedrake that measure mesh distribution (or explain how to avoid or
>>>> exclude it)?
>>>> Thank you for your time and any info or tips.
>>>> [1]
>>>> https://github.com/ISI-apex/firedrake-bench/blob/master/cahn_hilliard/firedrake_cahn_hilliard_problem.py
>>>> [2] Unstructured Overlapping Mesh Distribution in Parallel, Matthew G.
>>>> Knepley, Michael Lange, Gerard J. Gorman, 2015.
>>>> https://arxiv.org/pdf/1506.06194.pdf
>>>> [3] Efficient mesh management in Firedrake using PETSc-DMPlex, Michael
>>>> Lange, Lawrence Mitchell, Matthew G. Knepley and Gerard J. Gorman, SISC,
>>>> 38(5), S143-S155, 2016. http://arxiv.org/abs/1506.07749
