[petsc-dev] New implementation of PtAP based on all-at-once algorithm
Fande Kong
fdkong.jd at gmail.com
Sun Apr 14 23:39:35 CDT 2019
On Fri, Apr 12, 2019 at 7:27 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>
> On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
>
>>
>>
>> > On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev <
>> petsc-dev at mcs.anl.gov> wrote:
>> >
>> > Interesting, nice work.
>> >
>> > It would be interesting to get the flop counters working.
>> >
>> > This looks like GMG, I assume 3D.
>> >
>> > The degree of parallelism is not very realistic. You should probably
>> run a 10x smaller problem, at least, or use 10x more processes.
>>
>> Why do you say that? He's got his machine with a certain amount of
>> physical memory per node, are you saying he should ignore/not use 90% of
>> that physical memory for his simulation?
>
>
> In my experience 1.5M equations/process about 50x more than applications
> run, but this is just anecdotal. Some apps are dominated by the linear
> solver in terms of memory but some apps use a lot of memory in the physics
> parts of the code.
>
The test case is solving the multigroup neutron transport equations where
each mesh vertex could be associated with a hundred or a thousand
variables. The mesh is actually small so that it can be handled efficiently
in the physics part of the code. 90% of the memory is consumed by the
solver (SNES, KSP, PC). This is the reason I was trying to implement a
memory friendly PtAP.
> The one app that I can think of where the memory usage is dominated by the
> solver does like 10 (pseudo) time steps with pretty hard nonlinear solves,
> so in the end they are not bound by turnaround time. But they are kind of a
> odd (academic) application and not very representative of what I see in the
> broader comp sci community. And these guys do have a scalable code so
> instead of waiting a week on the queue to run a 10 hour job that uses 10%
> of the machine, they wait a day to run a 2 hour job that takes 50% of the
> machine because centers scheduling policies work that way.
>
Our code is scalable but we do not have a huge machine unfortunately.
>
> He should buy a machine 10x bigger just because it means having less
>> degrees of freedom per node (whose footing the bill for this purchase?). At
>> INL they run simulations for a purpose, not just for scalability studies
>> and there are no dang GPUs or barely used over-sized monstrocities sitting
>> around to brag about twice a year at SC.
>>
>
> I guess the are the nuke guys. I've never worked with them or seen this
> kind of complexity analysis in their talks, but OK if they fill up memory
> with the solver then this is representative of a significant (DOE)app.
>
You do not see the complexity analysis in the talks because most of the
people at INL live in a different community. I will convince more people
give talks in our community in the future.
We focus on the nuclear energy simulations that involve multiphysics
(neutron transport, mechanics contact, computational materials,
compressible/incompressible flows, two-phase flows, etc.). We are
developing a flexible platform (open source) that allows different physics
guys couple their code together efficiently.
https://mooseframework.inl.gov/old
Thanks,
Fande,
>
>
>>
>> Barry
>>
>>
>>
>> > I guess it does not matter. This basically like a one node run because
>> the subdomains are so large.
>> >
>> > And are you sure the numerics are the same with and without hypre?
>> Hypre is 15x slower. Any ideas what is going on?
>> >
>> > It might be interesting to scale this test down to a node to see if
>> this is from communication.
>> >
>> > Again, nice work,
>> > Mark
>> >
>> >
>> > On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>> > Hi Developers,
>> >
>> > I just want to share a good news. It is known PETSc-ptap-scalable is
>> taking too much memory for some applications because it needs to build
>> intermediate data structures. According to Mark's suggestions, I
>> implemented the all-at-once algorithm that does not cache any intermediate
>> data.
>> >
>> > I did some comparison, the new implementation is actually scalable in
>> terms of the memory usage and the compute time even though it is still
>> slower than "ptap-scalable". There are some memory profiling results (see
>> the attachments). The new all-at-once implementation use the similar amount
>> of memory as hypre, but it way faster than hypre.
>> >
>> > For example, for a problem with 14,893,346,880 unknowns using 10,000
>> processor cores, There are timing results:
>> >
>> > Hypre algorithm:
>> >
>> > MatPtAP 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>> 3.3e+04 6.0e+02 33 0 1 0 17 33 0 1 0 17 0
>> > MatPtAPSymbolic 50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> > MatPtAPNumeric 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>> 3.3e+04 6.0e+02 33 0 1 0 17 33 0 1 0 17 0
>> >
>> > PETSc scalable PtAP:
>> >
>> > MatPtAP 50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07
>> 2.0e+05 7.5e+02 2 1 4 6 20 2 1 4 6 20 129418
>> > MatPtAPSymbolic 50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07
>> 1.4e+05 3.5e+02 1 0 3 3 9 1 0 3 3 9 0
>> > MatPtAPNumeric 50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07
>> 3.1e+05 4.0e+02 1 1 2 4 11 1 1 2 4 11 235011
>> >
>> > New implementation of the all-at-once algorithm:
>> >
>> > MatPtAP 50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08
>> 1.4e+05 6.0e+02 4 0 7 7 17 4 0 7 7 17 0
>> > MatPtAPSymbolic 50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07
>> 1.2e+05 2.0e+02 2 0 5 4 6 2 0 5 4 6 0
>> > MatPtAPNumeric 50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07
>> 2.0e+05 4.0e+02 2 0 2 3 11 2 0 2 3 11 0
>> >
>> >
>> > You can see here the all-at-once is a bit slower than ptap-scalable,
>> but it uses only much less memory.
>> >
>> >
>> > Fande
>> >
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190414/82f1009c/attachment-0001.html>
More information about the petsc-dev
mailing list