[petsc-dev] New implementation of PtAP based on all-at-once algorithm

Mon Apr 15 09:42:07 CDT 2019

>
>
> I guess you are interested in the performance of the new algorithms on
>  small problems. I will try to test a petsc example such as
> mat/examples/tests/ex96.c.
>

It's not a big deal. And the fact that they are similar on one node tells
us the kernels are similar.

>
>
>>
>> And are you sure the numerics are the same with and without hypre? Hypre
>> is 15x slower. Any ideas what is going on?
>>
>
> Hypre performs pretty good when the number of processor core is small ( a
> couple of hundreds).  I guess the issue is related to how they handle the
> communications.
>
>
>>
>> It might be interesting to scale this test down to a node to see if this
>> is from communication.
>>
>
I wonder if the their symbolic setup is getting called every time. You do
50 solves it looks like and that should be enough to amortize a one time
setup cost.

Does PETSc do any clever scalability tricks? You just pack and send point
to point messages I would think, but maybe Hypre is doing something bad. I
have seen Hypre scale out to large machine but on synthetic problems.

So this is a realistic problem. Can you run with -info and grep on GAMG and
send me the (~20 lines) of output. You will be able to see info about each
level, like number of equations and average nnz/row.

>
> Hypre preforms similarly as petsc on a single compute node.
>
>
> Fande,
>
>
>>
>> Again, nice work,
>> Mark
>>
>>
>> On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>>
>>> Hi Developers,
>>>
>>> I just want to share a good news.  It is known PETSc-ptap-scalable is
>>> taking too much memory for some applications because it needs to build
>>> intermediate data structures.  According to Mark's suggestions, I
>>> implemented the  all-at-once algorithm that does not cache any intermediate
>>> data.
>>>
>>> I did some comparison,  the new implementation is actually scalable in
>>> terms of the memory usage and the compute time even though it is still
>>> slower than "ptap-scalable".   There are some memory profiling results (see
>>> the attachments). The new all-at-once implementation use the similar amount
>>> of memory as hypre, but it way faster than hypre.
>>>
>>> For example, for a problem with 14,893,346,880 unknowns using 10,000
>>> processor cores,  There are timing results:
>>>
>>> Hypre algorithm:
>>>
>>> MatPtAP               50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
>>> 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
>>> MatPtAPSymbolic       50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatPtAPNumeric        50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
>>> 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
>>>
>>> PETSc scalable PtAP:
>>>
>>> MatPtAP               50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05
>>> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
>>> MatPtAPSymbolic       50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05
>>> 3.5e+02  1  0  3  3  9   1  0  3  3  9     0
>>> MatPtAPNumeric        50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05
>>> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>>>
>>> New implementation of the all-at-once algorithm:
>>>
>>> MatPtAP               50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05
>>> 6.0e+02  4  0  7  7 17   4  0  7  7 17     0
>>> MatPtAPSymbolic       50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05
>>> 2.0e+02  2  0  5  4  6   2  0  5  4  6     0
>>> MatPtAPNumeric        50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05
>>> 4.0e+02  2  0  2  3 11   2  0  2  3 11     0
>>>
>>>
>>> You can see here the all-at-once is a bit slower than ptap-scalable, but
>>> it uses only much less memory.
>>>
>>>
>>> Fande
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190415/90c33f64/attachment-0001.html>