[petsc-users] Poor speed up for KSP example 45

Mark Adams mfadams at lbl.gov
Wed Mar 25 18:18:32 CDT 2020


On Wed, Mar 25, 2020 at 6:40 PM Fande Kong <fdkong.jd at gmail.com> wrote:

>
>
> On Wed, Mar 25, 2020 at 12:18 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>> Also, a better test is see where streams pretty much saturates, then run
>> that many processors per node and do the same test by increasing the nodes.
>> This will tell you how well your network communication is doing.
>>
>> But this result has a lot of stuff in "network communication" that can be
>> further evaluated. The worst thing about this, I would think, is that the
>> partitioning is blind to the memory hierarchy of inter and intra node
>> communication.
>>
>
> Hierarchical partitioning was designed for this purpose.
> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/MatOrderings/MATPARTITIONINGHIERARCH.html#MATPARTITIONINGHIERARCH
>
>
That's fantastic!


> Fande,
>
>
>> The next thing to do is run with an initial grid that puts one cell per
>> node and the do uniform refinement, until you have one cell per process
>> (eg, one refinement step using 8 processes per node), partition to get one
>> cell per process, then do uniform refinement to get a reasonable sized
>> local problem. Alas, this is not easy to do, but it is doable.
>>
>> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> I would guess that you are saturating the memory bandwidth. After
>>> you make PETSc (make all) it will suggest that you test it (make test) and
>>> suggest that you run streams (make streams).
>>>
>>> I see Matt answered but let me add that when you make streams you will
>>> seed the memory rate for 1,2,3, ... NP processes. If your machine is decent
>>> you should see very good speed up at the beginning and then it will start
>>> to saturate. You are seeing about 50% of perfect speedup at 16 process. I
>>> would expect that you will see something similar with streams. Without
>>> knowing your machine, your results look typical.
>>>
>>> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi <aminthefresh at gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I ran KSP example 45 on a single node with 32 cores and 125GB memory
>>>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
>>>> during KSP.solve:
>>>>
>>>> - 1 MPI process: ~98 sec, speedup: 1X
>>>> - 16 MPI processes: ~12 sec, speedup: ~8X
>>>> - 32 MPI processes: ~11 sec, speedup: ~9X
>>>>
>>>> Since the problem size is large enough (8M unknowns), I expected a
>>>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
>>>> can it be improved?
>>>>
>>>> I've attached three log files for more details.
>>>>
>>>> Sincerely,
>>>> Amin
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200325/1720c753/attachment.html>


More information about the petsc-users mailing list