[petsc-users] Poor speed up for KSP example 45

Amin Sadeghi aminthefresh at gmail.com
Wed Mar 25 17:04:03 CDT 2020


That's great. Thanks for creating this great piece of software!

Amin

On Wed, Mar 25, 2020 at 5:56 PM Matthew Knepley <knepley at gmail.com> wrote:

> On Wed, Mar 25, 2020 at 5:41 PM Amin Sadeghi <aminthefresh at gmail.com>
> wrote:
>
>> Junchao, thank you for doing the experiment, I guess TACC Frontera nodes
>> have higher memory bandwidth (maybe more modern CPU architecture, although
>> I'm not familiar as to which hardware affect memory bandwidth) than Compute
>> Canada's Graham.
>>
>> Mark, I did as you suggested. As you suspected, running make streams
>> yielded the same results, indicating that the memory bandwidth saturated at
>> around 8 MPI processes. I ran the experiment on multiple nodes but only
>> requested 8 cores per node, and here is the result:
>>
>> 1 node (8 cores total): 17.5s, 6X speedup
>> 2 nodes (16 cores total): 13.5s, 7X speedup
>> 3 nodes (24 cores total): 9.4s, 10X speedup
>> 4 nodes (32 cores total): 8.3s, 12X speedup
>> 5 nodes (40 cores total): 7.0s, 14X speedup
>> 6 nodes (48 cores total): 61.4s, 2X speedup [!!!]
>> 7 nodes (56 cores total): 4.3s, 23X speedup
>> 8 nodes (64 cores total): 3.7s, 27X speedup
>>
>> *Note:* as you can see, the experiment with 6 nodes showed extremely
>> poor scaling, which I guess was an outlier, maybe due to some connection
>> problem?
>>
>> I also ran another experiment, requesting 2 full nodes, i.e. 64 cores,
>> and here's the result:
>>
>> 2 nodes (64 cores total): 6.0s, 16X speedup [32 cores each node]
>>
>> So, it turns out that given a fixed number of cores, i.e. 64 in our case,
>> much better speedups (27X vs. 16X in our case) can be achieved if they are
>> distributed among separate nodes.
>>
>> Anyways, I really appreciate all your inputs.
>>
>> *One final question:* From what I understand from Mark's comment, PETSc
>> at the moment is blind to memory hierarchy, is it feasible to make PETSc
>> aware of the inter and intra node communication so that partitioning is
>> done to maximize performance? Or, to put it differently, is this something
>> that PETSc devs have their eyes on for the future?
>>
>
> There is already stuff in VecScatter that knows about the memory
> hierarchy, which Junchao put in. We are actively working on some other
> node-aware algorithms.
>
>   Thanks,
>
>      Matt
>
>
>> Sincerely,
>> Amin
>>
>>
>> On Wed, Mar 25, 2020 at 3:51 PM Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>>> I repeated your experiment on one node of TACC Frontera,
>>> 1 rank: 85.0s
>>> 16 ranks: 8.2s, 10x speedup
>>> 32 ranks: 5.7s, 15x speedup
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Wed, Mar 25, 2020 at 1:18 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> Also, a better test is see where streams pretty much saturates, then
>>>> run that many processors per node and do the same test by increasing the
>>>> nodes. This will tell you how well your network communication is doing.
>>>>
>>>> But this result has a lot of stuff in "network communication" that can
>>>> be further evaluated. The worst thing about this, I would think, is that
>>>> the partitioning is blind to the memory hierarchy of inter and intra node
>>>> communication. The next thing to do is run with an initial grid that puts
>>>> one cell per node and the do uniform refinement, until you have one cell
>>>> per process (eg, one refinement step using 8 processes per node), partition
>>>> to get one cell per process, then do uniform refinement to get a
>>>> reasonable sized local problem. Alas, this is not easy to do, but it is
>>>> doable.
>>>>
>>>> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>> I would guess that you are saturating the memory bandwidth. After
>>>>> you make PETSc (make all) it will suggest that you test it (make test) and
>>>>> suggest that you run streams (make streams).
>>>>>
>>>>> I see Matt answered but let me add that when you make streams you will
>>>>> seed the memory rate for 1,2,3, ... NP processes. If your machine is decent
>>>>> you should see very good speed up at the beginning and then it will start
>>>>> to saturate. You are seeing about 50% of perfect speedup at 16 process. I
>>>>> would expect that you will see something similar with streams. Without
>>>>> knowing your machine, your results look typical.
>>>>>
>>>>> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi <aminthefresh at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I ran KSP example 45 on a single node with 32 cores and 125GB memory
>>>>>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
>>>>>> during KSP.solve:
>>>>>>
>>>>>> - 1 MPI process: ~98 sec, speedup: 1X
>>>>>> - 16 MPI processes: ~12 sec, speedup: ~8X
>>>>>> - 32 MPI processes: ~11 sec, speedup: ~9X
>>>>>>
>>>>>> Since the problem size is large enough (8M unknowns), I expected a
>>>>>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
>>>>>> can it be improved?
>>>>>>
>>>>>> I've attached three log files for more details.
>>>>>>
>>>>>> Sincerely,
>>>>>> Amin
>>>>>>
>>>>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200325/abc07dea/attachment-0001.html>


More information about the petsc-users mailing list