[petsc-users] Bad memory scaling with PETSc 3.10

Fri May 3 11:21:59 CDT 2019

I have some data from my own simulations. The results do not look bad.

The following are results (strong scaling) of "-matptap_via allatonce
-mat_freeintermediatedatastructures 1"

Problem 1 has  2,482,224,480 unknowns, and use 4000, 6000, 10000, and 12000
processor cores.

4000   processor cores:   587M
6000   processor cores:   270M
10000 processor cores:   251M
12000 processor cores:  136M

Problem 2 has 7,446,673,440 unknowns, and use 6000, 10000, and 12000
process cores:
6000   processor cores:   975M
10000 processor cores:   599M
12000 processor cores:   415M

The memory is used for PtAP only, and I do not include the memory from the
other part of the simulation.

I am sorry we did not resolve the issue for you so far. I will try to run
your example you attached earlier to if we can reproduce it. If we can
reproduce the problem, I will use a memory profiling tool to check where
the memory comes from.

Thanks again for your report,

Fande,

On Fri, May 3, 2019 at 9:26 AM Fande Kong <fdkong.jd at gmail.com> wrote:

> Thanks for your plots.
>
> The new algorithms should be scalable in terms of the memory usage. I am
> puzzled by these plots since the memory usage increases exponentially.  It
> may come from somewhere else? How do you measure the memory?  The memory is
> for the entire simulation or just PtAP? Could you measure the memory for
> PtAP only? Maybe several factors affect the memory usage not only PtAP.
>
>  I will grab some data from my own simulations.
>
> Are you running ex43?
>
> Fande,
>
>
>
> On Fri, May 3, 2019 at 8:14 AM Myriam Peyrounette <
> myriam.peyrounette at idris.fr> wrote:
>
>> And the attached files... Sorry
>>
>> Le 05/03/19 à 16:11, Myriam Peyrounette a écrit :
>>
>> Hi,
>>
>> I plotted new scalings (memory and time) using the new algorithms. I used
>> the options *-options_left true *to make sure that the options are
>> effectively used. They are.
>>
>> I don't have access to the platform I used to run my computations on, so
>> I ran them on a different one. In particular, I can't reach problem size =
>> 1e8 and the values might be different from the previous scalings I sent
>> you. But the comparison of the PETSc versions and options is still
>> relevant.
>>
>> I plotted the scalings of reference: the "good" one (PETSc 3.6.4) in
>> green, the "bad" one (PETSc 3.10.2) in blue.
>>
>> I used the commit d330a26 (3.11.1) for all the other scalings, adding
>> different sets of options:
>>
>> *Light blue* -> -matptap_via
>> allatonce  -mat_freeintermediatedatastructures 1
>> *Orange* -> -matptap_via allatonce_*merged* -mat_freeintermediatedatastructures
>> 1
>> *Purple* -> -matptap_via allatonce  -mat_freeintermediatedatastructures
>> 1 *-inner_diag_matmatmult_via scalable -inner_offdiag_matmatmult_via
>> scalable*
>> *Yellow*: -matptap_via allatonce_*merged* -mat_freeintermediatedatastructures
>> 1 *-inner_diag_matmatmult_via scalable -inner_offdiag_matmatmult_via
>> scalable*
>>
>> Conclusion: with regard to memory, the two algorithms imply a similarly
>> good improvement of the scaling. The use of the
>> -inner_(off)diag_matmatmult_via options is also very interesting. The
>> scaling is still not as good as 3.6.4 though.
>> With regard to time, I noted a real improvement in time execution! I used
>> to spend 200-300s on these executions. Now they take 10-15s. Beside that,
>> the "_merged" versions are more efficient. And the
>> -inner_(off)diaf_matmatmult_via options are slightly expensive but it is
>> not critical.
>>
>> What do you think? Is it possible to match again the scaling of PETSc
>> 3.6.4? Is it worthy keeping investigating?
>>
>> Myriam
>>
>>
>> Le 04/30/19 à 17:00, Fande Kong a écrit :
>>
>> HI Myriam,
>>
>> We are interesting how the new algorithms perform. So there are two new
>> algorithms you could try.
>>
>> Algorithm 1:
>>
>> -matptap_via allatonce  -mat_freeintermediatedatastructures 1
>>
>> Algorithm 2:
>>
>> -matptap_via allatonce_merged -mat_freeintermediatedatastructures 1
>>
>>
>> Note that you need to use the current petsc-master, and also please put
>> "-snes_view" in your script so that we can confirm these options are
>> actually get set.
>>
>> Thanks,
>>
>> Fande,
>>
>>
>> On Tue, Apr 30, 2019 at 2:26 AM Myriam Peyrounette via petsc-users <
>> petsc-users at mcs.anl.gov> wrote:
>>
>>> Hi,
>>>
>>> that's really good news for us, thanks! I will plot again the memory
>>> scaling using these new options and let you know. Next week I hope.
>>>
>>> Before that, I just need to clarify the situation. Throughout our
>>> discussions, we mentionned a number of options concerning the scalability:
>>>
>>> -matptatp_via scalable
>>> -inner_diag_matmatmult_via scalable
>>> -inner_diag_matmatmult_via scalable
>>> -mat_freeintermediatedatastructures
>>> -matptap_via allatonce
>>> -matptap_via allatonce_merged
>>>
>>> Which ones of them are compatible? Should I use all of them at the same
>>> time? Is there redundancy?
>>>
>>> Thanks,
>>>
>>> Myriam
>>>
>>> Le 04/25/19 à 21:47, Zhang, Hong a écrit :
>>>
>>> Myriam:
>>> Checking MatPtAP() in petsc-3.6.4, I realized that it uses different
>>> algorithm than petsc-10 and later versions. petsc-3.6 uses out-product for
>>> C=P^T * AP, while petsc-3.10 uses local transpose of P. petsc-3.10
>>> accelerates data accessing, but doubles the memory of P.
>>>
>>> Fande added two new implementations for MatPtAP() to petsc-master which
>>> use much smaller and scalable memories with slightly higher computing time
>>> (faster than hypre though). You may use these new implementations if you
>>> have concern on memory scalability. The option for these new implementation
>>> are:
>>> -matptap_via allatonce
>>> -matptap_via allatonce_merged
>>>
>>> Hong
>>>
>>> On Mon, Apr 15, 2019 at 12:10 PM hzhang at mcs.anl.gov <hzhang at mcs.anl.gov>
>>> wrote:
>>>
>>>> Myriam:
>>>> Thank you very much for providing these results!
>>>> I have put effort to accelerate execution time and avoid using global
>>>> sizes in PtAP, for which the algorithm of transpose of P_local and P_other
>>>> likely doubles the memory usage. I'll try to investigate why it becomes
>>>> unscalable.
>>>> Hong
>>>>
>>>>> Hi,
>>>>>
>>>>> you'll find the new scaling attached (green line). I used the version
>>>>> 3.11 and the four scalability options :
>>>>> -matptap_via scalable
>>>>> -inner_diag_matmatmult_via scalable
>>>>> -inner_offdiag_matmatmult_via scalable
>>>>> -mat_freeintermediatedatastructures
>>>>>
>>>>> The scaling is much better! The code even uses less memory for the
>>>>> smallest cases. There is still an increase for the larger one.
>>>>>
>>>>> With regard to the time scaling, I used KSPView and LogView on the two
>>>>> previous scalings (blue and yellow lines) but not on the last one (green
>>>>> line). So we can't really compare them, am I right? However, we can see
>>>>> that the new time scaling looks quite good. It slightly increases from ~8s
>>>>> to ~27s.
>>>>>
>>>>> Unfortunately, the computations are expensive so I would like to avoid
>>>>> re-run them if possible. How relevant would be a proper time scaling for
>>>>> you?
>>>>>
>>>>> Myriam
>>>>>
>>>>> Le 04/12/19 à 18:18, Zhang, Hong a écrit :
>>>>>
>>>>> Myriam :
>>>>> Thanks for your effort. It will help us improve PETSc.
>>>>> Hong
>>>>>
>>>>> Hi all,
>>>>>>
>>>>>> I used the wrong script, that's why it diverged... Sorry about that.
>>>>>> I tried again with the right script applied on a tiny problem (~200
>>>>>> elements). I can see a small difference in memory usage (gain ~ 1mB).
>>>>>> when adding the -mat_freeintermediatestructures option. I still have
>>>>>> to
>>>>>> execute larger cases to plot the scaling. The supercomputer I am used
>>>>>> to
>>>>>> run my jobs on is really busy at the moment so it takes a while. I
>>>>>> hope
>>>>>> I'll send you the results on Monday.
>>>>>>
>>>>>> Thanks everyone,
>>>>>>
>>>>>> Myriam
>>>>>>
>>>>>>
>>>>>> Le 04/11/19 à 06:01, Jed Brown a écrit :
>>>>>> > "Zhang, Hong" <hzhang at mcs.anl.gov> writes:
>>>>>> >
>>>>>> >> Jed:
>>>>>> >>>> Myriam,
>>>>>> >>>> Thanks for the plot. '-mat_freeintermediatedatastructures'
>>>>>> should not affect solution. It releases almost half of memory in C=PtAP if
>>>>>> C is not reused.
>>>>>> >>> And yet if turning it on causes divergence, that would imply a
>>>>>> bug.
>>>>>> >>> Hong, are you able to reproduce the experiment to see the memory
>>>>>> >>> scaling?
>>>>>> >> I like to test his code using an alcf machine, but my hands are
>>>>>> full now. I'll try it as soon as I find time, hopefully next week.
>>>>>> > I have now compiled and run her code locally.
>>>>>> >
>>>>>> > Myriam, thanks for your last mail adding configuration and removing
>>>>>> the
>>>>>> > MemManager.h dependency.  I ran with and without
>>>>>> > -mat_freeintermediatedatastructures and don't see a difference in
>>>>>> > convergence.  What commands did you run to observe that difference?
>>>>>>
>>>>>> --
>>>>>> Myriam Peyrounette
>>>>>> CNRS/IDRIS - HLST
>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Myriam Peyrounette
>>>>> CNRS/IDRIS - HLST
>>>>> --
>>>>>
>>>>>
>>> --
>>> Myriam Peyrounette
>>> CNRS/IDRIS - HLST
>>> --
>>>
>>>
>> --
>> Myriam Peyrounette
>> CNRS/IDRIS - HLST
>> --
>>
>>
>> --
>> Myriam Peyrounette
>> CNRS/IDRIS - HLST
>> --
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190503/034a5a05/attachment-0001.html>