[petsc-users] Parallel efficiency of the gmres solver with ASM

Lei Shi stoneszone at gmail.com
Thu Jun 25 15:48:06 CDT 2015


Hi Justin,

Thanks for your suggestion. I will test it asap.

Another thing confusing me is the wclock time with 2 cores is almost the
same as the serial run when I use asm with sub_ksp_type gmres and ilu0 on
subdomains. Serial run takes 11.95 sec and parallel run takes 10.5 sec.
There is almost no speedup at all.

And I found some other people got similar bad speedups when comparing 2
cores with 1 core. Attached is one slide from J.A. Davis's presentation. I
just found it from the web. As you can see, asm with 2 cores takes almost
the same cpu times compare 1 core too! May be I miss understanding some
fundamental things related to asm.

coresiterationserrpetsc solve wclock timespeedupefficiency121.15E-0411.95125
2.05E-0210.51.010.50462.19E-027.641.390.34






​

Sincerely Yours,

Lei Shi
---------

On Thu, Jun 25, 2015 at 3:34 PM, Justin Chang <jychang48 at gmail.com> wrote:

> Hi Lei,
>
> Depending on your machine and MPI library, you may have to use smart
> process to core/socket bindings to achieve better speedup. Instructions can
> be found here:
>
> http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
>
>
> Justin
>
> On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <stoneszone at gmail.com> wrote:
>
>> Hi Matt,
>>
>> Thanks for your suggestions. Here is the output from Stream test on one
>> node which has 20 cores. I run it up to 20. Attached are the dumped output
>> with your suggested options. Really appreciate your help!!!
>>
>> Number of MPI processes 1
>> Function      Rate (MB/s)
>> Copy:       13816.9372
>> Scale:       8020.1809
>> Add:        12762.3830
>> Triad:      11852.5016
>>
>> Number of MPI processes 2
>> Function      Rate (MB/s)
>> Copy:       22748.7681
>> Scale:      14081.4906
>> Add:        18998.4516
>> Triad:      18303.2494
>>
>> Number of MPI processes 3
>> Function      Rate (MB/s)
>> Copy:       34045.2510
>> Scale:      23410.9767
>> Add:        30320.2702
>> Triad:      30163.7977
>>
>> Number of MPI processes 4
>> Function      Rate (MB/s)
>> Copy:       36875.5349
>> Scale:      29440.1694
>> Add:        36971.1860
>> Triad:      37377.0103
>>
>> Number of MPI processes 5
>> Function      Rate (MB/s)
>> Copy:       32272.8763
>> Scale:      30316.3435
>> Add:        38022.0193
>> Triad:      38815.4830
>>
>> Number of MPI processes 6
>> Function      Rate (MB/s)
>> Copy:       35619.8925
>> Scale:      34457.5078
>> Add:        41419.3722
>> Triad:      35825.3621
>>
>> Number of MPI processes 7
>> Function      Rate (MB/s)
>> Copy:       55284.2420
>> Scale:      47706.8009
>> Add:        59076.4735
>> Triad:      61680.5559
>>
>> Number of MPI processes 8
>> Function      Rate (MB/s)
>> Copy:       44525.8901
>> Scale:      48949.9599
>> Add:        57437.7784
>> Triad:      56671.0593
>>
>> Number of MPI processes 9
>> Function      Rate (MB/s)
>> Copy:       34375.7364
>> Scale:      29507.5293
>> Add:        45405.3120
>> Triad:      39518.7559
>>
>> Number of MPI processes 10
>> Function      Rate (MB/s)
>> Copy:       34278.0415
>> Scale:      41721.7843
>> Add:        46642.2465
>> Triad:      45454.7000
>>
>> Number of MPI processes 11
>> Function      Rate (MB/s)
>> Copy:       38093.7244
>> Scale:      35147.2412
>> Add:        45047.0853
>> Triad:      44983.2013
>>
>> Number of MPI processes 12
>> Function      Rate (MB/s)
>> Copy:       39750.8760
>> Scale:      52038.0631
>> Add:        55552.9503
>> Triad:      54884.3839
>>
>> Number of MPI processes 13
>> Function      Rate (MB/s)
>> Copy:       60839.0248
>> Scale:      74143.7458
>> Add:        85545.3135
>> Triad:      85667.6551
>>
>> Number of MPI processes 14
>> Function      Rate (MB/s)
>> Copy:       37766.2343
>> Scale:      40279.1928
>> Add:        49992.8572
>> Triad:      50303.4809
>>
>> Number of MPI processes 15
>> Function      Rate (MB/s)
>> Copy:       49762.3670
>> Scale:      59077.8251
>> Add:        60407.9651
>> Triad:      61691.9456
>>
>> Number of MPI processes 16
>> Function      Rate (MB/s)
>> Copy:       31996.7169
>> Scale:      36962.4860
>> Add:        40183.5060
>> Triad:      41096.0512
>>
>> Number of MPI processes 17
>> Function      Rate (MB/s)
>> Copy:       36348.3839
>> Scale:      39108.6761
>> Add:        46853.4476
>> Triad:      47266.1778
>>
>> Number of MPI processes 18
>> Function      Rate (MB/s)
>> Copy:       40438.7558
>> Scale:      43195.5785
>> Add:        53063.4321
>> Triad:      53605.0293
>>
>> Number of MPI processes 19
>> Function      Rate (MB/s)
>> Copy:       30739.4908
>> Scale:      34280.8118
>> Add:        40710.5155
>> Triad:      43330.9503
>>
>> Number of MPI processes 20
>> Function      Rate (MB/s)
>> Copy:       37488.3777
>> Scale:      41791.8999
>> Add:        49518.9604
>> Triad:      48908.2677
>> ------------------------------------------------
>> np  speedup
>> 1 1.0
>> 2 1.54
>> 3 2.54
>> 4 3.15
>> 5 3.27
>> 6 3.02
>> 7 5.2
>> 8 4.78
>> 9 3.33
>> 10 3.84
>> 11 3.8
>> 12 4.63
>> 13 7.23
>> 14 4.24
>> 15 5.2
>> 16 3.47
>> 17 3.99
>> 18 4.52
>> 19 3.66
>> 20 4.13
>>
>>
>>
>>
>> Sincerely Yours,
>>
>> Lei Shi
>> ---------
>>
>> On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>
>>> 1) In order to understand this, we have to disentagle the various
>>> effect. First, run the STREAMS benchmark
>>>
>>>   make NPMAX=4 streams
>>>
>>> This will tell you the maximum speedup you can expect on this machine.
>>>
>>> 2) For these test cases, also send the output of
>>>
>>>   -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
>>>
>>>   Thanks,
>>>
>>>      Matt
>>>
>>>
>>>> I'm trying to improve the parallel efficiency of gmres solve in my. In
>>>> my CFD solver, Petsc gmres is used to solve the linear system generated by
>>>> the Newton's method. To test its efficiency, I started with a very simple
>>>> inviscid subsonic 3D flow as the first testcase. The parallel efficiency of
>>>> gmres solve with asm as the preconditioner is very bad. The results are
>>>> from our latest cluster. Right now, I'm only looking at the wclock time of
>>>> the ksp_solve.
>>>>
>>>>    1. First I tested ASM with gmres and ilu 0 for the sub domain , the
>>>>    cpu time of 2 cores is almost the same as the serial run. Here is the
>>>>    options for this case
>>>>
>>>> -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
>>>> -ksp_gmres_restart 30 -ksp_pc_side right
>>>> -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30
>>>> -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0
>>>> -sub_pc_factor_fill 1.9
>>>>
>>>> The iteration numbers increase a lot for parallel run.
>>>>
>>>> coresiterationserrpetsc solve wclock timespeedupefficiency121.15E-04
>>>> 11.951252.05E-0210.51.010.50462.19E-027.641.390.34
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>       2.  Then I tested ASM with ilu 0 as the preconditoner only, the
>>>> cpu time of 2 cores is better than the 1st test, but the speedup is still
>>>> very bad. Here is the options i'm using
>>>>
>>>> -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
>>>> -ksp_gmres_restart 30 -ksp_pc_side right
>>>> -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0
>>>>  -sub_pc_factor_fill 1.9
>>>>
>>>> coresiterationserrpetsc solve cpu timespeedupefficiency1104.54E-0410.68
>>>> 12119.55E-048.21.300.654123.59E-045.262.030.50
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>    Those results are from a third order "DG" scheme with a very coarse
>>>> 3D mesh (480 elements). I believe I should get some speedups for this test
>>>> even on this coarse mesh.
>>>>
>>>>   My question is why does the asm with a local solve take much longer
>>>> time than the asm as a preconditioner only? Also the accuracy is very bad
>>>> too I have tested changing the overlap of asm to 2, but make it even worse.
>>>>
>>>>   If I used a larger mesh ~4000 elements, the 2nd case with asm as the
>>>> preconditioner gives me a better speedup, but still not very good.
>>>>
>>>>
>>>> coresiterationserrpetsc solve cpu timespeedupefficiency171.91E-0297.321
>>>> 272.07E-0264.941.50.74472.61E-0236.972.60.65
>>>>
>>>>
>>>>
>>>> Attached are the log_summary dumped from petsc, any suggestions are
>>>> welcome. I really appreciate it.
>>>>
>>>>
>>>> Sincerely Yours,
>>>>
>>>> Lei Shi
>>>> ---------
>>>>
>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/434e1ad5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot - 06252015 - 03:44:53 PM.png
Type: image/png
Size: 159230 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/434e1ad5/attachment-0001.png>


More information about the petsc-users mailing list