[petsc-dev] New implementation of PtAP based on all-at-once algorithm

Sun Apr 14 23:43:57 CDT 2019

On Fri, Apr 12, 2019 at 10:50 AM Zhang, Hong via petsc-dev <
petsc-dev at mcs.anl.gov> wrote:

> I would suggest Fande add this new implementation into petsc.
>

Sure, I will. I still want to try a couple of idea to improve the
performance. The code will be pushed up once I am happy with the
performance.

> What is the algorithm?
>

Use Jed's script:

 for i:
    v[:] = 0 # sparse vector
    for j:
      v[:] += A[i,j] * P[j,:]
    for I:
      C[I,:] += P[i,I] * v[:]

> I'll try to see if I can further reduce memory consumption of the current
> symbolic PtAP when I get time.
>

Cool!

Fande,

> Hong
>
> On Fri, Apr 12, 2019 at 8:27 AM Mark Adams via petsc-dev <
> petsc-dev at mcs.anl.gov> wrote:
>
>>
>>
>> On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F. <bsmith at mcs.anl.gov>
>> wrote:
>>
>>>
>>>
>>> > On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev <
>>> petsc-dev at mcs.anl.gov> wrote:
>>> >
>>> > Interesting, nice work.
>>> >
>>> > It would be interesting to get the flop counters working.
>>> >
>>> > This looks like GMG, I assume 3D.
>>> >
>>> > The degree of parallelism is not very realistic. You should probably
>>> run a 10x smaller problem, at least, or use 10x more processes.
>>>
>>>    Why do you say that? He's got his machine with a certain amount of
>>> physical memory per node, are you saying he should ignore/not use 90% of
>>> that physical memory for his simulation?
>>
>>
>> In my experience 1.5M equations/process about 50x more than applications
>> run, but this is just anecdotal. Some apps are dominated by the linear
>> solver in terms of memory but some apps use a lot of memory in the physics
>> parts of the code.
>>
>> The one app that I can think of where the memory usage is dominated by
>> the solver does like 10 (pseudo) time steps with pretty hard nonlinear
>> solves, so in the end they are not bound by turnaround time. But they are
>> kind of a odd (academic) application and not very representative of what I
>> see in the broader comp sci community. And these guys do have a scalable
>> code so instead of waiting a week on the queue to run a 10 hour job that
>> uses 10% of the machine, they wait a day to run a 2 hour job that takes 50%
>> of the machine because centers scheduling policies work that way.
>>
>> He should buy a machine 10x bigger just because it means having less
>>> degrees of freedom per node (whose footing the bill for this purchase?). At
>>> INL they run simulations for a purpose, not just for scalability studies
>>> and there are no dang GPUs or barely used over-sized monstrocities sitting
>>> around to brag about twice a year at SC.
>>>
>>
>> I guess the are the nuke guys. I've never worked with them or seen this
>> kind of complexity analysis in their talks, but OK if they fill up memory
>> with the solver then this is representative of a significant (DOE)app.
>>
>>
>>>
>>>    Barry
>>>
>>>
>>>
>>> > I guess it does not matter. This basically like a one node run because
>>> the subdomains are so large.
>>> >
>>> > And are you sure the numerics are the same with and without hypre?
>>> Hypre is 15x slower. Any ideas what is going on?
>>> >
>>> > It might be interesting to scale this test down to a node to see if
>>> this is from communication.
>>> >
>>> > Again, nice work,
>>> > Mark
>>> >
>>> >
>>> > On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong.jd at gmail.com>
>>> wrote:
>>> > Hi Developers,
>>> >
>>> > I just want to share a good news.  It is known PETSc-ptap-scalable is
>>> taking too much memory for some applications because it needs to build
>>> intermediate data structures.  According to Mark's suggestions, I
>>> implemented the  all-at-once algorithm that does not cache any intermediate
>>> data.
>>> >
>>> > I did some comparison,  the new implementation is actually scalable in
>>> terms of the memory usage and the compute time even though it is still
>>> slower than "ptap-scalable".   There are some memory profiling results (see
>>> the attachments). The new all-at-once implementation use the similar amount
>>> of memory as hypre, but it way faster than hypre.
>>> >
>>> > For example, for a problem with 14,893,346,880 unknowns using 10,000
>>> processor cores,  There are timing results:
>>> >
>>> > Hypre algorithm:
>>> >
>>> > MatPtAP               50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>>> 3.3e+04 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
>>> > MatPtAPSymbolic       50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> > MatPtAPNumeric        50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>>> 3.3e+04 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
>>> >
>>> > PETSc scalable PtAP:
>>> >
>>> > MatPtAP               50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07
>>> 2.0e+05 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
>>> > MatPtAPSymbolic       50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07
>>> 1.4e+05 3.5e+02  1  0  3  3  9   1  0  3  3  9     0
>>> > MatPtAPNumeric        50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07
>>> 3.1e+05 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>>> >
>>> > New implementation of the all-at-once algorithm:
>>> >
>>> > MatPtAP               50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08
>>> 1.4e+05 6.0e+02  4  0  7  7 17   4  0  7  7 17     0
>>> > MatPtAPSymbolic       50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07
>>> 1.2e+05 2.0e+02  2  0  5  4  6   2  0  5  4  6     0
>>> > MatPtAPNumeric        50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07
>>> 2.0e+05 4.0e+02  2  0  2  3 11   2  0  2  3 11     0
>>> >
>>> >
>>> > You can see here the all-at-once is a bit slower than ptap-scalable,
>>> but it uses only much less memory.
>>> >
>>> >
>>> > Fande
>>> >
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190414/967b8c40/attachment.html>