[petsc-users] Parallel efficiency of the gmres solver with ASM

Justin Chang jychang48 at gmail.com
Thu Jun 25 15:34:50 CDT 2015


Hi Lei,

Depending on your machine and MPI library, you may have to use smart
process to core/socket bindings to achieve better speedup. Instructions can
be found here:

http://www.mcs.anl.gov/petsc/documentation/faq.html#computers


Justin

On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <stoneszone at gmail.com> wrote:

> Hi Matt,
>
> Thanks for your suggestions. Here is the output from Stream test on one
> node which has 20 cores. I run it up to 20. Attached are the dumped output
> with your suggested options. Really appreciate your help!!!
>
> Number of MPI processes 1
> Function      Rate (MB/s)
> Copy:       13816.9372
> Scale:       8020.1809
> Add:        12762.3830
> Triad:      11852.5016
>
> Number of MPI processes 2
> Function      Rate (MB/s)
> Copy:       22748.7681
> Scale:      14081.4906
> Add:        18998.4516
> Triad:      18303.2494
>
> Number of MPI processes 3
> Function      Rate (MB/s)
> Copy:       34045.2510
> Scale:      23410.9767
> Add:        30320.2702
> Triad:      30163.7977
>
> Number of MPI processes 4
> Function      Rate (MB/s)
> Copy:       36875.5349
> Scale:      29440.1694
> Add:        36971.1860
> Triad:      37377.0103
>
> Number of MPI processes 5
> Function      Rate (MB/s)
> Copy:       32272.8763
> Scale:      30316.3435
> Add:        38022.0193
> Triad:      38815.4830
>
> Number of MPI processes 6
> Function      Rate (MB/s)
> Copy:       35619.8925
> Scale:      34457.5078
> Add:        41419.3722
> Triad:      35825.3621
>
> Number of MPI processes 7
> Function      Rate (MB/s)
> Copy:       55284.2420
> Scale:      47706.8009
> Add:        59076.4735
> Triad:      61680.5559
>
> Number of MPI processes 8
> Function      Rate (MB/s)
> Copy:       44525.8901
> Scale:      48949.9599
> Add:        57437.7784
> Triad:      56671.0593
>
> Number of MPI processes 9
> Function      Rate (MB/s)
> Copy:       34375.7364
> Scale:      29507.5293
> Add:        45405.3120
> Triad:      39518.7559
>
> Number of MPI processes 10
> Function      Rate (MB/s)
> Copy:       34278.0415
> Scale:      41721.7843
> Add:        46642.2465
> Triad:      45454.7000
>
> Number of MPI processes 11
> Function      Rate (MB/s)
> Copy:       38093.7244
> Scale:      35147.2412
> Add:        45047.0853
> Triad:      44983.2013
>
> Number of MPI processes 12
> Function      Rate (MB/s)
> Copy:       39750.8760
> Scale:      52038.0631
> Add:        55552.9503
> Triad:      54884.3839
>
> Number of MPI processes 13
> Function      Rate (MB/s)
> Copy:       60839.0248
> Scale:      74143.7458
> Add:        85545.3135
> Triad:      85667.6551
>
> Number of MPI processes 14
> Function      Rate (MB/s)
> Copy:       37766.2343
> Scale:      40279.1928
> Add:        49992.8572
> Triad:      50303.4809
>
> Number of MPI processes 15
> Function      Rate (MB/s)
> Copy:       49762.3670
> Scale:      59077.8251
> Add:        60407.9651
> Triad:      61691.9456
>
> Number of MPI processes 16
> Function      Rate (MB/s)
> Copy:       31996.7169
> Scale:      36962.4860
> Add:        40183.5060
> Triad:      41096.0512
>
> Number of MPI processes 17
> Function      Rate (MB/s)
> Copy:       36348.3839
> Scale:      39108.6761
> Add:        46853.4476
> Triad:      47266.1778
>
> Number of MPI processes 18
> Function      Rate (MB/s)
> Copy:       40438.7558
> Scale:      43195.5785
> Add:        53063.4321
> Triad:      53605.0293
>
> Number of MPI processes 19
> Function      Rate (MB/s)
> Copy:       30739.4908
> Scale:      34280.8118
> Add:        40710.5155
> Triad:      43330.9503
>
> Number of MPI processes 20
> Function      Rate (MB/s)
> Copy:       37488.3777
> Scale:      41791.8999
> Add:        49518.9604
> Triad:      48908.2677
> ------------------------------------------------
> np  speedup
> 1 1.0
> 2 1.54
> 3 2.54
> 4 3.15
> 5 3.27
> 6 3.02
> 7 5.2
> 8 4.78
> 9 3.33
> 10 3.84
> 11 3.8
> 12 4.63
> 13 7.23
> 14 4.24
> 15 5.2
> 16 3.47
> 17 3.99
> 18 4.52
> 19 3.66
> 20 4.13
>
>
>
>
> Sincerely Yours,
>
> Lei Shi
> ---------
>
> On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com>
> wrote:
>
>> On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
>>
>>> Hello,
>>>
>>
>> 1) In order to understand this, we have to disentagle the various effect.
>> First, run the STREAMS benchmark
>>
>>   make NPMAX=4 streams
>>
>> This will tell you the maximum speedup you can expect on this machine.
>>
>> 2) For these test cases, also send the output of
>>
>>   -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>> I'm trying to improve the parallel efficiency of gmres solve in my. In
>>> my CFD solver, Petsc gmres is used to solve the linear system generated by
>>> the Newton's method. To test its efficiency, I started with a very simple
>>> inviscid subsonic 3D flow as the first testcase. The parallel efficiency of
>>> gmres solve with asm as the preconditioner is very bad. The results are
>>> from our latest cluster. Right now, I'm only looking at the wclock time of
>>> the ksp_solve.
>>>
>>>    1. First I tested ASM with gmres and ilu 0 for the sub domain , the
>>>    cpu time of 2 cores is almost the same as the serial run. Here is the
>>>    options for this case
>>>
>>> -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
>>> -ksp_gmres_restart 30 -ksp_pc_side right
>>> -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30
>>> -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0
>>> -sub_pc_factor_fill 1.9
>>>
>>> The iteration numbers increase a lot for parallel run.
>>>
>>> coresiterationserrpetsc solve wclock timespeedupefficiency121.15E-04
>>> 11.951252.05E-0210.51.010.50462.19E-027.641.390.34
>>>
>>>
>>>
>>>
>>>
>>>
>>>       2.  Then I tested ASM with ilu 0 as the preconditoner only, the
>>> cpu time of 2 cores is better than the 1st test, but the speedup is still
>>> very bad. Here is the options i'm using
>>>
>>> -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
>>> -ksp_gmres_restart 30 -ksp_pc_side right
>>> -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0
>>>  -sub_pc_factor_fill 1.9
>>>
>>> coresiterationserrpetsc solve cpu timespeedupefficiency1104.54E-0410.681
>>> 2119.55E-048.21.300.654123.59E-045.262.030.50
>>>
>>>
>>>
>>>
>>>
>>>
>>>    Those results are from a third order "DG" scheme with a very coarse
>>> 3D mesh (480 elements). I believe I should get some speedups for this test
>>> even on this coarse mesh.
>>>
>>>   My question is why does the asm with a local solve take much longer
>>> time than the asm as a preconditioner only? Also the accuracy is very bad
>>> too I have tested changing the overlap of asm to 2, but make it even worse.
>>>
>>>   If I used a larger mesh ~4000 elements, the 2nd case with asm as the
>>> preconditioner gives me a better speedup, but still not very good.
>>>
>>>
>>> coresiterationserrpetsc solve cpu timespeedupefficiency171.91E-0297.3212
>>> 72.07E-0264.941.50.74472.61E-0236.972.60.65
>>>
>>>
>>>
>>> Attached are the log_summary dumped from petsc, any suggestions are
>>> welcome. I really appreciate it.
>>>
>>>
>>> Sincerely Yours,
>>>
>>> Lei Shi
>>> ---------
>>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/e4db7677/attachment.html>


More information about the petsc-users mailing list