[petsc-users] Parallel efficiency of the gmres solver with ASM

Thu Jun 25 16:27:17 CDT 2015

Hi Justin,

I tested with mpirun --binding cpu:sockets .... Unfortunately, the results
are almost the same as before. Thanks.

-ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
-ksp_gmres_restart 30 -ksp_pc_side right
-pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0  -sub_pc_factor_fill
1.9

coresiterationserrpetsc solve cpu time

1104.54E-0410.65
2119.55E-048.19

4123.59E-045.32

Sincerely Yours,

Lei Shi
---------

On Thu, Jun 25, 2015 at 3:48 PM, Lei Shi <stoneszone at gmail.com> wrote:

> Hi Justin,
>
> Thanks for your suggestion. I will test it asap.
>
> Another thing confusing me is the wclock time with 2 cores is almost the
> same as the serial run when I use asm with sub_ksp_type gmres and ilu0 on
> subdomains. Serial run takes 11.95 sec and parallel run takes 10.5 sec.
> There is almost no speedup at all.
>
> And I found some other people got similar bad speedups when comparing 2
> cores with 1 core. Attached is one slide from J.A. Davis's presentation. I
> just found it from the web. As you can see, asm with 2 cores takes almost
> the same cpu times compare 1 core too! May be I miss understanding some
> fundamental things related to asm.
>
> coresiterationserrpetsc solve wclock timespeedupefficiency121.15E-0411.951
> 252.05E-0210.51.010.50462.19E-027.641.390.34
>
>
>
>
>
>
> 
>
> Sincerely Yours,
>
> Lei Shi
> ---------
>
> On Thu, Jun 25, 2015 at 3:34 PM, Justin Chang <jychang48 at gmail.com> wrote:
>
>> Hi Lei,
>>
>> Depending on your machine and MPI library, you may have to use smart
>> process to core/socket bindings to achieve better speedup. Instructions can
>> be found here:
>>
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
>>
>>
>> Justin
>>
>> On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <stoneszone at gmail.com> wrote:
>>
>>> Hi Matt,
>>>
>>> Thanks for your suggestions. Here is the output from Stream test on one
>>> node which has 20 cores. I run it up to 20. Attached are the dumped output
>>> with your suggested options. Really appreciate your help!!!
>>>
>>> Number of MPI processes 1
>>> Function      Rate (MB/s)
>>> Copy:       13816.9372
>>> Scale:       8020.1809
>>> Add:        12762.3830
>>> Triad:      11852.5016
>>>
>>> Number of MPI processes 2
>>> Function      Rate (MB/s)
>>> Copy:       22748.7681
>>> Scale:      14081.4906
>>> Add:        18998.4516
>>> Triad:      18303.2494
>>>
>>> Number of MPI processes 3
>>> Function      Rate (MB/s)
>>> Copy:       34045.2510
>>> Scale:      23410.9767
>>> Add:        30320.2702
>>> Triad:      30163.7977
>>>
>>> Number of MPI processes 4
>>> Function      Rate (MB/s)
>>> Copy:       36875.5349
>>> Scale:      29440.1694
>>> Add:        36971.1860
>>> Triad:      37377.0103
>>>
>>> Number of MPI processes 5
>>> Function      Rate (MB/s)
>>> Copy:       32272.8763
>>> Scale:      30316.3435
>>> Add:        38022.0193
>>> Triad:      38815.4830
>>>
>>> Number of MPI processes 6
>>> Function      Rate (MB/s)
>>> Copy:       35619.8925
>>> Scale:      34457.5078
>>> Add:        41419.3722
>>> Triad:      35825.3621
>>>
>>> Number of MPI processes 7
>>> Function      Rate (MB/s)
>>> Copy:       55284.2420
>>> Scale:      47706.8009
>>> Add:        59076.4735
>>> Triad:      61680.5559
>>>
>>> Number of MPI processes 8
>>> Function      Rate (MB/s)
>>> Copy:       44525.8901
>>> Scale:      48949.9599
>>> Add:        57437.7784
>>> Triad:      56671.0593
>>>
>>> Number of MPI processes 9
>>> Function      Rate (MB/s)
>>> Copy:       34375.7364
>>> Scale:      29507.5293
>>> Add:        45405.3120
>>> Triad:      39518.7559
>>>
>>> Number of MPI processes 10
>>> Function      Rate (MB/s)
>>> Copy:       34278.0415
>>> Scale:      41721.7843
>>> Add:        46642.2465
>>> Triad:      45454.7000
>>>
>>> Number of MPI processes 11
>>> Function      Rate (MB/s)
>>> Copy:       38093.7244
>>> Scale:      35147.2412
>>> Add:        45047.0853
>>> Triad:      44983.2013
>>>
>>> Number of MPI processes 12
>>> Function      Rate (MB/s)
>>> Copy:       39750.8760
>>> Scale:      52038.0631
>>> Add:        55552.9503
>>> Triad:      54884.3839
>>>
>>> Number of MPI processes 13
>>> Function      Rate (MB/s)
>>> Copy:       60839.0248
>>> Scale:      74143.7458
>>> Add:        85545.3135
>>> Triad:      85667.6551
>>>
>>> Number of MPI processes 14
>>> Function      Rate (MB/s)
>>> Copy:       37766.2343
>>> Scale:      40279.1928
>>> Add:        49992.8572
>>> Triad:      50303.4809
>>>
>>> Number of MPI processes 15
>>> Function      Rate (MB/s)
>>> Copy:       49762.3670
>>> Scale:      59077.8251
>>> Add:        60407.9651
>>> Triad:      61691.9456
>>>
>>> Number of MPI processes 16
>>> Function      Rate (MB/s)
>>> Copy:       31996.7169
>>> Scale:      36962.4860
>>> Add:        40183.5060
>>> Triad:      41096.0512
>>>
>>> Number of MPI processes 17
>>> Function      Rate (MB/s)
>>> Copy:       36348.3839
>>> Scale:      39108.6761
>>> Add:        46853.4476
>>> Triad:      47266.1778
>>>
>>> Number of MPI processes 18
>>> Function      Rate (MB/s)
>>> Copy:       40438.7558
>>> Scale:      43195.5785
>>> Add:        53063.4321
>>> Triad:      53605.0293
>>>
>>> Number of MPI processes 19
>>> Function      Rate (MB/s)
>>> Copy:       30739.4908
>>> Scale:      34280.8118
>>> Add:        40710.5155
>>> Triad:      43330.9503
>>>
>>> Number of MPI processes 20
>>> Function      Rate (MB/s)
>>> Copy:       37488.3777
>>> Scale:      41791.8999
>>> Add:        49518.9604
>>> Triad:      48908.2677
>>> ------------------------------------------------
>>> np  speedup
>>> 1 1.0
>>> 2 1.54
>>> 3 2.54
>>> 4 3.15
>>> 5 3.27
>>> 6 3.02
>>> 7 5.2
>>> 8 4.78
>>> 9 3.33
>>> 10 3.84
>>> 11 3.8
>>> 12 4.63
>>> 13 7.23
>>> 14 4.24
>>> 15 5.2
>>> 16 3.47
>>> 17 3.99
>>> 18 4.52
>>> 19 3.66
>>> 20 4.13
>>>
>>>
>>>
>>>
>>> Sincerely Yours,
>>>
>>> Lei Shi
>>> ---------
>>>
>>> On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>
>>>> 1) In order to understand this, we have to disentagle the various
>>>> effect. First, run the STREAMS benchmark
>>>>
>>>>   make NPMAX=4 streams
>>>>
>>>> This will tell you the maximum speedup you can expect on this machine.
>>>>
>>>> 2) For these test cases, also send the output of
>>>>
>>>>   -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
>>>>
>>>>   Thanks,
>>>>
>>>>      Matt
>>>>
>>>>
>>>>> I'm trying to improve the parallel efficiency of gmres solve in my. In
>>>>> my CFD solver, Petsc gmres is used to solve the linear system generated by
>>>>> the Newton's method. To test its efficiency, I started with a very simple
>>>>> inviscid subsonic 3D flow as the first testcase. The parallel efficiency of
>>>>> gmres solve with asm as the preconditioner is very bad. The results are
>>>>> from our latest cluster. Right now, I'm only looking at the wclock time of
>>>>> the ksp_solve.
>>>>>
>>>>>    1. First I tested ASM with gmres and ilu 0 for the sub domain ,
>>>>>    the cpu time of 2 cores is almost the same as the serial run. Here is the
>>>>>    options for this case
>>>>>
>>>>> -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
>>>>> -ksp_gmres_restart 30 -ksp_pc_side right
>>>>> -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol
>>>>> 1e-30
>>>>> -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0
>>>>> -sub_pc_factor_fill 1.9
>>>>>
>>>>> The iteration numbers increase a lot for parallel run.
>>>>>
>>>>> coresiterationserrpetsc solve wclock timespeedupefficiency121.15E-04
>>>>> 11.951252.05E-0210.51.010.50462.19E-027.641.390.34
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       2.  Then I tested ASM with ilu 0 as the preconditoner only, the
>>>>> cpu time of 2 cores is better than the 1st test, but the speedup is still
>>>>> very bad. Here is the options i'm using
>>>>>
>>>>> -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
>>>>> -ksp_gmres_restart 30 -ksp_pc_side right
>>>>> -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0
>>>>>  -sub_pc_factor_fill 1.9
>>>>>
>>>>> coresiterationserrpetsc solve cpu timespeedupefficiency1104.54E-04
>>>>> 10.6812119.55E-048.21.300.654123.59E-045.262.030.50
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>    Those results are from a third order "DG" scheme with a very coarse
>>>>> 3D mesh (480 elements). I believe I should get some speedups for this test
>>>>> even on this coarse mesh.
>>>>>
>>>>>   My question is why does the asm with a local solve take much longer
>>>>> time than the asm as a preconditioner only? Also the accuracy is very bad
>>>>> too I have tested changing the overlap of asm to 2, but make it even worse.
>>>>>
>>>>>   If I used a larger mesh ~4000 elements, the 2nd case with asm as the
>>>>> preconditioner gives me a better speedup, but still not very good.
>>>>>
>>>>>
>>>>> coresiterationserrpetsc solve cpu timespeedupefficiency171.91E-0297.32
>>>>> 1272.07E-0264.941.50.74472.61E-0236.972.60.65
>>>>>
>>>>>
>>>>>
>>>>> Attached are the log_summary dumped from petsc, any suggestions are
>>>>> welcome. I really appreciate it.
>>>>>
>>>>>
>>>>> Sincerely Yours,
>>>>>
>>>>> Lei Shi
>>>>> ---------
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/4a81ccb3/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot - 06252015 - 03:44:53 PM.png
Type: image/png
Size: 159230 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/4a81ccb3/attachment-0001.png>