[petsc-dev] New implementation of PtAP based on all-at-once algorithm

Mon Apr 15 14:17:19 CDT 2019

On Mon, Apr 15, 2019 at 1:06 PM Mark Adams <mfadams at lbl.gov> wrote:

>
>
> On Mon, Apr 15, 2019 at 2:56 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>
>>
>>
>> On Mon, Apr 15, 2019 at 6:49 AM Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Mon, Apr 15, 2019 at 12:41 AM Fande Kong via petsc-dev <
>>> petsc-dev at mcs.anl.gov> wrote:
>>>
>>>> On Fri, Apr 12, 2019 at 7:27 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F. <bsmith at mcs.anl.gov>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> > On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev <
>>>>>> petsc-dev at mcs.anl.gov> wrote:
>>>>>> >
>>>>>> > Interesting, nice work.
>>>>>> >
>>>>>> > It would be interesting to get the flop counters working.
>>>>>> >
>>>>>> > This looks like GMG, I assume 3D.
>>>>>> >
>>>>>> > The degree of parallelism is not very realistic. You should
>>>>>> probably run a 10x smaller problem, at least, or use 10x more processes.
>>>>>>
>>>>>>    Why do you say that? He's got his machine with a certain amount of
>>>>>> physical memory per node, are you saying he should ignore/not use 90% of
>>>>>> that physical memory for his simulation?
>>>>>
>>>>>
>>>>> In my experience 1.5M equations/process about 50x more than
>>>>> applications run, but this is just anecdotal. Some apps are dominated by
>>>>> the linear solver in terms of memory but some apps use a lot of memory in
>>>>> the physics parts of the code.
>>>>>
>>>>
>>>> The test case is solving the multigroup neutron transport equations
>>>> where each mesh vertex could be associated with a hundred or a thousand
>>>> variables. The mesh is actually small so that it can be handled efficiently
>>>> in the physics part of the code. 90% of the memory is consumed by the
>>>> solver (SNES, KSP, PC). This is the reason I was trying to implement a
>>>> memory friendly PtAP.
>>>>
>>>>
>>>>> The one app that I can think of where the memory usage is dominated by
>>>>> the solver does like 10 (pseudo) time steps with pretty hard nonlinear
>>>>> solves, so in the end they are not bound by turnaround time. But they are
>>>>> kind of a odd (academic) application and not very representative of what I
>>>>> see in the broader comp sci community. And these guys do have a scalable
>>>>> code so instead of waiting a week on the queue to run a 10 hour job that
>>>>> uses 10% of the machine, they wait a day to run a 2 hour job that takes 50%
>>>>> of the machine because centers scheduling policies work that way.
>>>>>
>>>>
>>>> Our code is scalable but we do not have a huge machine unfortunately.
>>>>
>>>>
>>>>>
>>>>> He should buy a machine 10x bigger just because it means having less
>>>>>> degrees of freedom per node (whose footing the bill for this purchase?). At
>>>>>> INL they run simulations for a purpose, not just for scalability studies
>>>>>> and there are no dang GPUs or barely used over-sized monstrocities sitting
>>>>>> around to brag about twice a year at SC.
>>>>>>
>>>>>
>>>>> I guess the are the nuke guys. I've never worked with them or seen
>>>>> this kind of complexity analysis in their talks, but OK if they fill up
>>>>> memory with the solver then this is representative of a significant
>>>>> (DOE)app.
>>>>>
>>>>
>>>> You do not see the complexity analysis  in the talks because most of
>>>> the people at INL live in a different community.  I will convince more
>>>> people give talks in our community in the future.
>>>>
>>>> We focus on the nuclear energy simulations that involve multiphysics
>>>> (neutron transport, mechanics contact, computational materials,
>>>> compressible/incompressible flows, two-phase flows, etc.). We are
>>>> developing a flexible platform (open source) that allows different physics
>>>> guys couple their code together efficiently.
>>>> https://mooseframework.inl.gov/old
>>>>
>>>
>>> Fande, this is very interesting. Can you tell me:
>>>
>>>   1) A rough estimate of dofs/vertex (or cell or face) depending on
>>> where you put unknowns
>>>
>>
>> The big run (Neutron transport equations) posted earlier has 576
>> variables on each mesh vertex. Physics guys think at the current stage
>> 100-1000 variables (the number of energy groups times the number of neutron
>> flying directions) on each mesh vertex will give us an acceptable
>> simulation result.  1000 variables are preferred.
>>
>>
>>
>>>
>>>   2) Are all unknowns on the same vertex coupled together? If not, where
>>> do you specify block sparsity?
>>>
>>
>> Yes, they are physically coupled together through the scattering and the
>> fission events. But we are using the matrix-free method, and the variables
>> coupling is ignored in the  preconditioning matrix so that the system won't
>> take that much memory.
>>
>
> So the preconditioner looks like 576 independent Laplacian solves,
> mathematically.
>

We somehow could think about this way, but we are solving the whole
system simultaneously to have a good concurrency.

So you could reorder your equations and see a block diagonal matrix with
> 576 blocks. right?
>

I not sure I understand the question correctly. For each mesh vertex, we
have a 576x576 diagonal matrix.   The unknowns are ordered in this way:
v0, v2.., v575 for vertex 1, and another 576 variables for mesh vertex 2,
and so on.

Thanks,

Fande,

>
>>
>>
>>>
>>>   3) How are the coefficients from the equation discretized on the mesh?
>>>
>>
>> The coefficients (often referred to as cross sections for neutron guys)
>> could be different for each variable, and they totally depend on the
>> reactor configuration. My current simulation indeed uses heterogeneous
>> materials.
>>
>> I actually have a preprint that presents more details on the simulation.
>> https://arxiv.org/abs/1903.03659
>>
>> Thanks,
>>
>> Fande,
>>
>>
>>
>>
>>>
>>>   Thanks!
>>>
>>>      Matt
>>>
>>>
>>>> Thanks,
>>>>
>>>> Fande,
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>    Barry
>>>>>>
>>>>>>
>>>>>>
>>>>>> > I guess it does not matter. This basically like a one node run
>>>>>> because the subdomains are so large.
>>>>>> >
>>>>>> > And are you sure the numerics are the same with and without hypre?
>>>>>> Hypre is 15x slower. Any ideas what is going on?
>>>>>> >
>>>>>> > It might be interesting to scale this test down to a node to see if
>>>>>> this is from communication.
>>>>>> >
>>>>>> > Again, nice work,
>>>>>> > Mark
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong.jd at gmail.com>
>>>>>> wrote:
>>>>>> > Hi Developers,
>>>>>> >
>>>>>> > I just want to share a good news.  It is known PETSc-ptap-scalable
>>>>>> is taking too much memory for some applications because it needs to build
>>>>>> intermediate data structures.  According to Mark's suggestions, I
>>>>>> implemented the  all-at-once algorithm that does not cache any intermediate
>>>>>> data.
>>>>>> >
>>>>>> > I did some comparison,  the new implementation is actually scalable
>>>>>> in terms of the memory usage and the compute time even though it is still
>>>>>> slower than "ptap-scalable".   There are some memory profiling results (see
>>>>>> the attachments). The new all-at-once implementation use the similar amount
>>>>>> of memory as hypre, but it way faster than hypre.
>>>>>> >
>>>>>> > For example, for a problem with 14,893,346,880 unknowns using
>>>>>> 10,000 processor cores,  There are timing results:
>>>>>> >
>>>>>> > Hypre algorithm:
>>>>>> >
>>>>>> > MatPtAP               50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>>>>>> 3.3e+04 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
>>>>>> > MatPtAPSymbolic       50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>> > MatPtAPNumeric        50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>>>>>> 3.3e+04 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
>>>>>> >
>>>>>> > PETSc scalable PtAP:
>>>>>> >
>>>>>> > MatPtAP               50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07
>>>>>> 2.0e+05 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
>>>>>> > MatPtAPSymbolic       50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07
>>>>>> 1.4e+05 3.5e+02  1  0  3  3  9   1  0  3  3  9     0
>>>>>> > MatPtAPNumeric        50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07
>>>>>> 3.1e+05 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>>>>>> >
>>>>>> > New implementation of the all-at-once algorithm:
>>>>>> >
>>>>>> > MatPtAP               50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08
>>>>>> 1.4e+05 6.0e+02  4  0  7  7 17   4  0  7  7 17     0
>>>>>> > MatPtAPSymbolic       50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07
>>>>>> 1.2e+05 2.0e+02  2  0  5  4  6   2  0  5  4  6     0
>>>>>> > MatPtAPNumeric        50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07
>>>>>> 2.0e+05 4.0e+02  2  0  2  3 11   2  0  2  3 11     0
>>>>>> >
>>>>>> >
>>>>>> > You can see here the all-at-once is a bit slower than
>>>>>> ptap-scalable, but it uses only much less memory.
>>>>>> >
>>>>>> >
>>>>>> > Fande
>>>>>> >
>>>>>>
>>>>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190415/eaad1c07/attachment-0001.html>