[petsc-dev] New implementation of PtAP based on all-at-once algorithm

Mon Apr 15 15:58:55 CDT 2019

On Mon, Apr 15, 2019 at 2:56 PM Fande Kong <fdkong.jd at gmail.com> wrote:

> On Mon, Apr 15, 2019 at 6:49 AM Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Mon, Apr 15, 2019 at 12:41 AM Fande Kong via petsc-dev <
>> petsc-dev at mcs.anl.gov> wrote:
>>
>>> On Fri, Apr 12, 2019 at 7:27 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F. <bsmith at mcs.anl.gov>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> > On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev <
>>>>> petsc-dev at mcs.anl.gov> wrote:
>>>>> >
>>>>> > Interesting, nice work.
>>>>> >
>>>>> > It would be interesting to get the flop counters working.
>>>>> >
>>>>> > This looks like GMG, I assume 3D.
>>>>> >
>>>>> > The degree of parallelism is not very realistic. You should probably
>>>>> run a 10x smaller problem, at least, or use 10x more processes.
>>>>>
>>>>>    Why do you say that? He's got his machine with a certain amount of
>>>>> physical memory per node, are you saying he should ignore/not use 90% of
>>>>> that physical memory for his simulation?
>>>>
>>>>
>>>> In my experience 1.5M equations/process about 50x more than
>>>> applications run, but this is just anecdotal. Some apps are dominated by
>>>> the linear solver in terms of memory but some apps use a lot of memory in
>>>> the physics parts of the code.
>>>>
>>>
>>> The test case is solving the multigroup neutron transport equations
>>> where each mesh vertex could be associated with a hundred or a thousand
>>> variables. The mesh is actually small so that it can be handled efficiently
>>> in the physics part of the code. 90% of the memory is consumed by the
>>> solver (SNES, KSP, PC). This is the reason I was trying to implement a
>>> memory friendly PtAP.
>>>
>>>
>>>> The one app that I can think of where the memory usage is dominated by
>>>> the solver does like 10 (pseudo) time steps with pretty hard nonlinear
>>>> solves, so in the end they are not bound by turnaround time. But they are
>>>> kind of a odd (academic) application and not very representative of what I
>>>> see in the broader comp sci community. And these guys do have a scalable
>>>> code so instead of waiting a week on the queue to run a 10 hour job that
>>>> uses 10% of the machine, they wait a day to run a 2 hour job that takes 50%
>>>> of the machine because centers scheduling policies work that way.
>>>>
>>>
>>> Our code is scalable but we do not have a huge machine unfortunately.
>>>
>>>
>>>>
>>>> He should buy a machine 10x bigger just because it means having less
>>>>> degrees of freedom per node (whose footing the bill for this purchase?). At
>>>>> INL they run simulations for a purpose, not just for scalability studies
>>>>> and there are no dang GPUs or barely used over-sized monstrocities sitting
>>>>> around to brag about twice a year at SC.
>>>>>
>>>>
>>>> I guess the are the nuke guys. I've never worked with them or seen this
>>>> kind of complexity analysis in their talks, but OK if they fill up memory
>>>> with the solver then this is representative of a significant (DOE)app.
>>>>
>>>
>>> You do not see the complexity analysis  in the talks because most of the
>>> people at INL live in a different community.  I will convince more people
>>> give talks in our community in the future.
>>>
>>> We focus on the nuclear energy simulations that involve multiphysics
>>> (neutron transport, mechanics contact, computational materials,
>>> compressible/incompressible flows, two-phase flows, etc.). We are
>>> developing a flexible platform (open source) that allows different physics
>>> guys couple their code together efficiently.
>>> https://mooseframework.inl.gov/old
>>>
>>
>> Fande, this is very interesting. Can you tell me:
>>
>>   1) A rough estimate of dofs/vertex (or cell or face) depending on where
>> you put unknowns
>>
>
> The big run (Neutron transport equations) posted earlier has 576 variables
> on each mesh vertex. Physics guys think at the current stage 100-1000
> variables (the number of energy groups times the number of neutron flying
> directions) on each mesh vertex will give us an acceptable simulation
> result.  1000 variables are preferred.
>

Thanks so much for these numbers. This is very interesting.

>
>
>>
>>   2) Are all unknowns on the same vertex coupled together? If not, where
>> do you specify block sparsity?
>>
>
> Yes, they are physically coupled together through the scattering and the
> fission events. But we are using the matrix-free method, and the variables
> coupling is ignored in the  preconditioning matrix so that the system won't
> take that much memory.
>

I am interested in the coupling purely for problem definition. I see the PC
is just Jacobi.

>
>>   3) How are the coefficients from the equation discretized on the mesh?
>>
>
> The coefficients (often referred to as cross sections for neutron guys)
> could be different for each variable, and they totally depend on the
> reactor configuration. My current simulation indeed uses heterogeneous
> materials.
>

What I mean is are the coefficients constant on a cell (P0), defined by
vertex values (P1), etc.

> I actually have a preprint that presents more details on the simulation.
> https://arxiv.org/abs/1903.03659
>

I will read it.

  Thanks,

    Matt

> Thanks,
>
> Fande,
>
>
>
>
>>
>>   Thanks!
>>
>>      Matt
>>
>>
>>> Thanks,
>>>
>>> Fande,
>>>
>>>
>>>>
>>>>
>>>>>
>>>>>    Barry
>>>>>
>>>>>
>>>>>
>>>>> > I guess it does not matter. This basically like a one node run
>>>>> because the subdomains are so large.
>>>>> >
>>>>> > And are you sure the numerics are the same with and without hypre?
>>>>> Hypre is 15x slower. Any ideas what is going on?
>>>>> >
>>>>> > It might be interesting to scale this test down to a node to see if
>>>>> this is from communication.
>>>>> >
>>>>> > Again, nice work,
>>>>> > Mark
>>>>> >
>>>>> >
>>>>> > On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong.jd at gmail.com>
>>>>> wrote:
>>>>> > Hi Developers,
>>>>> >
>>>>> > I just want to share a good news.  It is known PETSc-ptap-scalable
>>>>> is taking too much memory for some applications because it needs to build
>>>>> intermediate data structures.  According to Mark's suggestions, I
>>>>> implemented the  all-at-once algorithm that does not cache any intermediate
>>>>> data.
>>>>> >
>>>>> > I did some comparison,  the new implementation is actually scalable
>>>>> in terms of the memory usage and the compute time even though it is still
>>>>> slower than "ptap-scalable".   There are some memory profiling results (see
>>>>> the attachments). The new all-at-once implementation use the similar amount
>>>>> of memory as hypre, but it way faster than hypre.
>>>>> >
>>>>> > For example, for a problem with 14,893,346,880 unknowns using 10,000
>>>>> processor cores,  There are timing results:
>>>>> >
>>>>> > Hypre algorithm:
>>>>> >
>>>>> > MatPtAP               50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>>>>> 3.3e+04 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
>>>>> > MatPtAPSymbolic       50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> > MatPtAPNumeric        50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>>>>> 3.3e+04 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
>>>>> >
>>>>> > PETSc scalable PtAP:
>>>>> >
>>>>> > MatPtAP               50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07
>>>>> 2.0e+05 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
>>>>> > MatPtAPSymbolic       50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07
>>>>> 1.4e+05 3.5e+02  1  0  3  3  9   1  0  3  3  9     0
>>>>> > MatPtAPNumeric        50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07
>>>>> 3.1e+05 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>>>>> >
>>>>> > New implementation of the all-at-once algorithm:
>>>>> >
>>>>> > MatPtAP               50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08
>>>>> 1.4e+05 6.0e+02  4  0  7  7 17   4  0  7  7 17     0
>>>>> > MatPtAPSymbolic       50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07
>>>>> 1.2e+05 2.0e+02  2  0  5  4  6   2  0  5  4  6     0
>>>>> > MatPtAPNumeric        50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07
>>>>> 2.0e+05 4.0e+02  2  0  2  3 11   2  0  2  3 11     0
>>>>> >
>>>>> >
>>>>> > You can see here the all-at-once is a bit slower than ptap-scalable,
>>>>> but it uses only much less memory.
>>>>> >
>>>>> >
>>>>> > Fande
>>>>> >
>>>>>
>>>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190415/7c0ab918/attachment-0001.html>