[petsc-users] Poor speed up for KSP example 45

Wed Mar 25 16:55:57 CDT 2020

On Wed, Mar 25, 2020 at 5:41 PM Amin Sadeghi <aminthefresh at gmail.com> wrote:

> Junchao, thank you for doing the experiment, I guess TACC Frontera nodes
> have higher memory bandwidth (maybe more modern CPU architecture, although
> I'm not familiar as to which hardware affect memory bandwidth) than Compute
> Canada's Graham.
>
> Mark, I did as you suggested. As you suspected, running make streams
> yielded the same results, indicating that the memory bandwidth saturated at
> around 8 MPI processes. I ran the experiment on multiple nodes but only
> requested 8 cores per node, and here is the result:
>
> 1 node (8 cores total): 17.5s, 6X speedup
> 2 nodes (16 cores total): 13.5s, 7X speedup
> 3 nodes (24 cores total): 9.4s, 10X speedup
> 4 nodes (32 cores total): 8.3s, 12X speedup
> 5 nodes (40 cores total): 7.0s, 14X speedup
> 6 nodes (48 cores total): 61.4s, 2X speedup [!!!]
> 7 nodes (56 cores total): 4.3s, 23X speedup
> 8 nodes (64 cores total): 3.7s, 27X speedup
>
> *Note:* as you can see, the experiment with 6 nodes showed extremely poor
> scaling, which I guess was an outlier, maybe due to some connection problem?
>
> I also ran another experiment, requesting 2 full nodes, i.e. 64 cores, and
> here's the result:
>
> 2 nodes (64 cores total): 6.0s, 16X speedup [32 cores each node]
>
> So, it turns out that given a fixed number of cores, i.e. 64 in our case,
> much better speedups (27X vs. 16X in our case) can be achieved if they are
> distributed among separate nodes.
>
> Anyways, I really appreciate all your inputs.
>
> *One final question:* From what I understand from Mark's comment, PETSc
> at the moment is blind to memory hierarchy, is it feasible to make PETSc
> aware of the inter and intra node communication so that partitioning is
> done to maximize performance? Or, to put it differently, is this something
> that PETSc devs have their eyes on for the future?
>

There is already stuff in VecScatter that knows about the memory hierarchy,
which Junchao put in. We are actively working on some other node-aware
algorithms.

  Thanks,

     Matt

> Sincerely,
> Amin
>
>
> On Wed, Mar 25, 2020 at 3:51 PM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>> I repeated your experiment on one node of TACC Frontera,
>> 1 rank: 85.0s
>> 16 ranks: 8.2s, 10x speedup
>> 32 ranks: 5.7s, 15x speedup
>>
>> --Junchao Zhang
>>
>>
>> On Wed, Mar 25, 2020 at 1:18 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> Also, a better test is see where streams pretty much saturates, then run
>>> that many processors per node and do the same test by increasing the nodes.
>>> This will tell you how well your network communication is doing.
>>>
>>> But this result has a lot of stuff in "network communication" that can
>>> be further evaluated. The worst thing about this, I would think, is that
>>> the partitioning is blind to the memory hierarchy of inter and intra node
>>> communication. The next thing to do is run with an initial grid that puts
>>> one cell per node and the do uniform refinement, until you have one cell
>>> per process (eg, one refinement step using 8 processes per node), partition
>>> to get one cell per process, then do uniform refinement to get a
>>> reasonable sized local problem. Alas, this is not easy to do, but it is
>>> doable.
>>>
>>> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> I would guess that you are saturating the memory bandwidth. After
>>>> you make PETSc (make all) it will suggest that you test it (make test) and
>>>> suggest that you run streams (make streams).
>>>>
>>>> I see Matt answered but let me add that when you make streams you will
>>>> seed the memory rate for 1,2,3, ... NP processes. If your machine is decent
>>>> you should see very good speed up at the beginning and then it will start
>>>> to saturate. You are seeing about 50% of perfect speedup at 16 process. I
>>>> would expect that you will see something similar with streams. Without
>>>> knowing your machine, your results look typical.
>>>>
>>>> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi <aminthefresh at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I ran KSP example 45 on a single node with 32 cores and 125GB memory
>>>>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
>>>>> during KSP.solve:
>>>>>
>>>>> - 1 MPI process: ~98 sec, speedup: 1X
>>>>> - 16 MPI processes: ~12 sec, speedup: ~8X
>>>>> - 32 MPI processes: ~11 sec, speedup: ~9X
>>>>>
>>>>> Since the problem size is large enough (8M unknowns), I expected a
>>>>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
>>>>> can it be improved?
>>>>>
>>>>> I've attached three log files for more details.
>>>>>
>>>>> Sincerely,
>>>>> Amin
>>>>>
>>>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200325/f1a20b4a/attachment.html>