[petsc-dev] New implementation of PtAP based on all-at-once algorithm
Fande Kong
fdkong.jd at gmail.com
Sun Apr 14 23:22:15 CDT 2019
Thanks for your reply, Marks,
On Thu, Apr 11, 2019 at 8:06 PM Mark Adams <mfadams at lbl.gov> wrote:
> Interesting, nice work.
>
> It would be interesting to get the flop counters working.
>
> This looks like GMG, I assume 3D.
>
> The degree of parallelism is not very realistic.
>
We are actually working on real physics simulations. Not a benchmark at all.
> You should probably run a 10x smaller problem, at least,
>
This is the coarsest mesh that can resolve the geometry completely, and so
I can not make the problem smaller.
> or use 10x more processes.
>
I do not have such a machine :(.
> I guess it does not matter. This basically like a one node run because the
> subdomains are so large.
>
I guess you are interested in the performance of the new algorithms on
small problems. I will try to test a petsc example such as
mat/examples/tests/ex96.c.
>
> And are you sure the numerics are the same with and without hypre? Hypre
> is 15x slower. Any ideas what is going on?
>
Hypre performs pretty good when the number of processor core is small ( a
couple of hundreds). I guess the issue is related to how they handle the
communications.
>
> It might be interesting to scale this test down to a node to see if this
> is from communication.
>
Hypre preforms similarly as petsc on a single compute node.
Fande,
>
> Again, nice work,
> Mark
>
>
> On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>
>> Hi Developers,
>>
>> I just want to share a good news. It is known PETSc-ptap-scalable is
>> taking too much memory for some applications because it needs to build
>> intermediate data structures. According to Mark's suggestions, I
>> implemented the all-at-once algorithm that does not cache any intermediate
>> data.
>>
>> I did some comparison, the new implementation is actually scalable in
>> terms of the memory usage and the compute time even though it is still
>> slower than "ptap-scalable". There are some memory profiling results (see
>> the attachments). The new all-at-once implementation use the similar amount
>> of memory as hypre, but it way faster than hypre.
>>
>> For example, for a problem with 14,893,346,880 unknowns using 10,000
>> processor cores, There are timing results:
>>
>> Hypre algorithm:
>>
>> MatPtAP 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
>> 6.0e+02 33 0 1 0 17 33 0 1 0 17 0
>> MatPtAPSymbolic 50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> MatPtAPNumeric 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
>> 6.0e+02 33 0 1 0 17 33 0 1 0 17 0
>>
>> PETSc scalable PtAP:
>>
>> MatPtAP 50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05
>> 7.5e+02 2 1 4 6 20 2 1 4 6 20 129418
>> MatPtAPSymbolic 50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05
>> 3.5e+02 1 0 3 3 9 1 0 3 3 9 0
>> MatPtAPNumeric 50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05
>> 4.0e+02 1 1 2 4 11 1 1 2 4 11 235011
>>
>> New implementation of the all-at-once algorithm:
>>
>> MatPtAP 50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05
>> 6.0e+02 4 0 7 7 17 4 0 7 7 17 0
>> MatPtAPSymbolic 50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05
>> 2.0e+02 2 0 5 4 6 2 0 5 4 6 0
>> MatPtAPNumeric 50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05
>> 4.0e+02 2 0 2 3 11 2 0 2 3 11 0
>>
>>
>> You can see here the all-at-once is a bit slower than ptap-scalable, but
>> it uses only much less memory.
>>
>>
>> Fande
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190414/d0c4e799/attachment.html>
More information about the petsc-dev
mailing list