[petsc-users] Parallel efficiency of the gmres solver with ASM
Lei Shi
stoneszone at gmail.com
Thu Jun 25 16:27:17 CDT 2015
Hi Justin,
I tested with mpirun --binding cpu:sockets .... Unfortunately, the results
are almost the same as before. Thanks.
-ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
-ksp_gmres_restart 30 -ksp_pc_side right
-pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0 -sub_pc_factor_fill
1.9
coresiterationserrpetsc solve cpu time
1104.54E-0410.65
2119.55E-048.19
4123.59E-045.32
Sincerely Yours,
Lei Shi
---------
On Thu, Jun 25, 2015 at 3:48 PM, Lei Shi <stoneszone at gmail.com> wrote:
> Hi Justin,
>
> Thanks for your suggestion. I will test it asap.
>
> Another thing confusing me is the wclock time with 2 cores is almost the
> same as the serial run when I use asm with sub_ksp_type gmres and ilu0 on
> subdomains. Serial run takes 11.95 sec and parallel run takes 10.5 sec.
> There is almost no speedup at all.
>
> And I found some other people got similar bad speedups when comparing 2
> cores with 1 core. Attached is one slide from J.A. Davis's presentation. I
> just found it from the web. As you can see, asm with 2 cores takes almost
> the same cpu times compare 1 core too! May be I miss understanding some
> fundamental things related to asm.
>
> coresiterationserrpetsc solve wclock timespeedupefficiency121.15E-0411.951
> 252.05E-0210.51.010.50462.19E-027.641.390.34
>
>
>
>
>
>
>
>
> Sincerely Yours,
>
> Lei Shi
> ---------
>
> On Thu, Jun 25, 2015 at 3:34 PM, Justin Chang <jychang48 at gmail.com> wrote:
>
>> Hi Lei,
>>
>> Depending on your machine and MPI library, you may have to use smart
>> process to core/socket bindings to achieve better speedup. Instructions can
>> be found here:
>>
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
>>
>>
>> Justin
>>
>> On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <stoneszone at gmail.com> wrote:
>>
>>> Hi Matt,
>>>
>>> Thanks for your suggestions. Here is the output from Stream test on one
>>> node which has 20 cores. I run it up to 20. Attached are the dumped output
>>> with your suggested options. Really appreciate your help!!!
>>>
>>> Number of MPI processes 1
>>> Function Rate (MB/s)
>>> Copy: 13816.9372
>>> Scale: 8020.1809
>>> Add: 12762.3830
>>> Triad: 11852.5016
>>>
>>> Number of MPI processes 2
>>> Function Rate (MB/s)
>>> Copy: 22748.7681
>>> Scale: 14081.4906
>>> Add: 18998.4516
>>> Triad: 18303.2494
>>>
>>> Number of MPI processes 3
>>> Function Rate (MB/s)
>>> Copy: 34045.2510
>>> Scale: 23410.9767
>>> Add: 30320.2702
>>> Triad: 30163.7977
>>>
>>> Number of MPI processes 4
>>> Function Rate (MB/s)
>>> Copy: 36875.5349
>>> Scale: 29440.1694
>>> Add: 36971.1860
>>> Triad: 37377.0103
>>>
>>> Number of MPI processes 5
>>> Function Rate (MB/s)
>>> Copy: 32272.8763
>>> Scale: 30316.3435
>>> Add: 38022.0193
>>> Triad: 38815.4830
>>>
>>> Number of MPI processes 6
>>> Function Rate (MB/s)
>>> Copy: 35619.8925
>>> Scale: 34457.5078
>>> Add: 41419.3722
>>> Triad: 35825.3621
>>>
>>> Number of MPI processes 7
>>> Function Rate (MB/s)
>>> Copy: 55284.2420
>>> Scale: 47706.8009
>>> Add: 59076.4735
>>> Triad: 61680.5559
>>>
>>> Number of MPI processes 8
>>> Function Rate (MB/s)
>>> Copy: 44525.8901
>>> Scale: 48949.9599
>>> Add: 57437.7784
>>> Triad: 56671.0593
>>>
>>> Number of MPI processes 9
>>> Function Rate (MB/s)
>>> Copy: 34375.7364
>>> Scale: 29507.5293
>>> Add: 45405.3120
>>> Triad: 39518.7559
>>>
>>> Number of MPI processes 10
>>> Function Rate (MB/s)
>>> Copy: 34278.0415
>>> Scale: 41721.7843
>>> Add: 46642.2465
>>> Triad: 45454.7000
>>>
>>> Number of MPI processes 11
>>> Function Rate (MB/s)
>>> Copy: 38093.7244
>>> Scale: 35147.2412
>>> Add: 45047.0853
>>> Triad: 44983.2013
>>>
>>> Number of MPI processes 12
>>> Function Rate (MB/s)
>>> Copy: 39750.8760
>>> Scale: 52038.0631
>>> Add: 55552.9503
>>> Triad: 54884.3839
>>>
>>> Number of MPI processes 13
>>> Function Rate (MB/s)
>>> Copy: 60839.0248
>>> Scale: 74143.7458
>>> Add: 85545.3135
>>> Triad: 85667.6551
>>>
>>> Number of MPI processes 14
>>> Function Rate (MB/s)
>>> Copy: 37766.2343
>>> Scale: 40279.1928
>>> Add: 49992.8572
>>> Triad: 50303.4809
>>>
>>> Number of MPI processes 15
>>> Function Rate (MB/s)
>>> Copy: 49762.3670
>>> Scale: 59077.8251
>>> Add: 60407.9651
>>> Triad: 61691.9456
>>>
>>> Number of MPI processes 16
>>> Function Rate (MB/s)
>>> Copy: 31996.7169
>>> Scale: 36962.4860
>>> Add: 40183.5060
>>> Triad: 41096.0512
>>>
>>> Number of MPI processes 17
>>> Function Rate (MB/s)
>>> Copy: 36348.3839
>>> Scale: 39108.6761
>>> Add: 46853.4476
>>> Triad: 47266.1778
>>>
>>> Number of MPI processes 18
>>> Function Rate (MB/s)
>>> Copy: 40438.7558
>>> Scale: 43195.5785
>>> Add: 53063.4321
>>> Triad: 53605.0293
>>>
>>> Number of MPI processes 19
>>> Function Rate (MB/s)
>>> Copy: 30739.4908
>>> Scale: 34280.8118
>>> Add: 40710.5155
>>> Triad: 43330.9503
>>>
>>> Number of MPI processes 20
>>> Function Rate (MB/s)
>>> Copy: 37488.3777
>>> Scale: 41791.8999
>>> Add: 49518.9604
>>> Triad: 48908.2677
>>> ------------------------------------------------
>>> np speedup
>>> 1 1.0
>>> 2 1.54
>>> 3 2.54
>>> 4 3.15
>>> 5 3.27
>>> 6 3.02
>>> 7 5.2
>>> 8 4.78
>>> 9 3.33
>>> 10 3.84
>>> 11 3.8
>>> 12 4.63
>>> 13 7.23
>>> 14 4.24
>>> 15 5.2
>>> 16 3.47
>>> 17 3.99
>>> 18 4.52
>>> 19 3.66
>>> 20 4.13
>>>
>>>
>>>
>>>
>>> Sincerely Yours,
>>>
>>> Lei Shi
>>> ---------
>>>
>>> On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>
>>>> 1) In order to understand this, we have to disentagle the various
>>>> effect. First, run the STREAMS benchmark
>>>>
>>>> make NPMAX=4 streams
>>>>
>>>> This will tell you the maximum speedup you can expect on this machine.
>>>>
>>>> 2) For these test cases, also send the output of
>>>>
>>>> -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
>>>>
>>>> Thanks,
>>>>
>>>> Matt
>>>>
>>>>
>>>>> I'm trying to improve the parallel efficiency of gmres solve in my. In
>>>>> my CFD solver, Petsc gmres is used to solve the linear system generated by
>>>>> the Newton's method. To test its efficiency, I started with a very simple
>>>>> inviscid subsonic 3D flow as the first testcase. The parallel efficiency of
>>>>> gmres solve with asm as the preconditioner is very bad. The results are
>>>>> from our latest cluster. Right now, I'm only looking at the wclock time of
>>>>> the ksp_solve.
>>>>>
>>>>> 1. First I tested ASM with gmres and ilu 0 for the sub domain ,
>>>>> the cpu time of 2 cores is almost the same as the serial run. Here is the
>>>>> options for this case
>>>>>
>>>>> -ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
>>>>> -ksp_gmres_restart 30 -ksp_pc_side right
>>>>> -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol
>>>>> 1e-30
>>>>> -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0
>>>>> -sub_pc_factor_fill 1.9
>>>>>
>>>>> The iteration numbers increase a lot for parallel run.
>>>>>
>>>>> coresiterationserrpetsc solve wclock timespeedupefficiency121.15E-04
>>>>> 11.951252.05E-0210.51.010.50462.19E-027.641.390.34
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2. Then I tested ASM with ilu 0 as the preconditoner only, the
>>>>> cpu time of 2 cores is better than the 1st test, but the speedup is still
>>>>> very bad. Here is the options i'm using
>>>>>
>>>>> -ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
>>>>> -ksp_gmres_restart 30 -ksp_pc_side right
>>>>> -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0
>>>>> -sub_pc_factor_fill 1.9
>>>>>
>>>>> coresiterationserrpetsc solve cpu timespeedupefficiency1104.54E-04
>>>>> 10.6812119.55E-048.21.300.654123.59E-045.262.030.50
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Those results are from a third order "DG" scheme with a very coarse
>>>>> 3D mesh (480 elements). I believe I should get some speedups for this test
>>>>> even on this coarse mesh.
>>>>>
>>>>> My question is why does the asm with a local solve take much longer
>>>>> time than the asm as a preconditioner only? Also the accuracy is very bad
>>>>> too I have tested changing the overlap of asm to 2, but make it even worse.
>>>>>
>>>>> If I used a larger mesh ~4000 elements, the 2nd case with asm as the
>>>>> preconditioner gives me a better speedup, but still not very good.
>>>>>
>>>>>
>>>>> coresiterationserrpetsc solve cpu timespeedupefficiency171.91E-0297.32
>>>>> 1272.07E-0264.941.50.74472.61E-0236.972.60.65
>>>>>
>>>>>
>>>>>
>>>>> Attached are the log_summary dumped from petsc, any suggestions are
>>>>> welcome. I really appreciate it.
>>>>>
>>>>>
>>>>> Sincerely Yours,
>>>>>
>>>>> Lei Shi
>>>>> ---------
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/4a81ccb3/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot - 06252015 - 03:44:53 PM.png
Type: image/png
Size: 159230 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/4a81ccb3/attachment-0001.png>
More information about the petsc-users
mailing list