[petsc-users] Parallel efficiency of the gmres solver with ASM

Thu Jun 25 21:35:12 CDT 2015

> On Jun 25, 2015, at 9:25 PM, Lei Shi <stoneszone at gmail.com> wrote:
> 
> Barry,
> 
> Thanks a lot for your reply. Your explanation helps me understand my test results. So In this case, to compute the speedup for a strong scalability test, I should use the the wall clock time with multiple cores as a reference time instead of serial run time? 
> 
> e.g. for computing speed up of 16 cores, i should use
> 
> 
> 
> instead of using
> 
> 
> 
> Another question is when I use asm as a preconditioner only, the speedup of 2 cores is much better than the case using asm with a local solve sub_ksp_type gmres. 
> -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50 
> -ksp_gmres_restart 30 -ksp_pc_side right
> -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0  -sub_pc_factor_fill 1.9
> cores	iterations	err	petsc solve cpu time	speedup	efficiency
> 1	10	4.54E-04	10.68	1	
> 2	11	9.55E-04	8.2	1.30	0.65
> 4	12	3.59E-04	5.26	2.03	0.50

  Using -sub_ksp_type germs results in "oversolving" the local problems which takes more time but does not improve (by much) the convergence of the "outer" linear solver. Unfortunately I don't know any way to automatically adjust the accuracy of the inner solver to minimize the solve time of the outer solve.

  The (relative) performance of the many solvers depends greatly on the size of the problem (for example for small problems using no preconditioner is often best but for larger problems something like GAMG is best). So, in order to come to some conclusion of what solver to use you need to run the tests for the the size of the problem you want to solve (not a smaller problem). So I recommend running a new set of tests using the problem size you need to solve with no preconditioner, bjacobi, ASM, and GAMG this will help you decide what works best for your problem.

  Barry

> 
> 
> 
> 
> 
> 
> What is the main differences between those two? Thanks. 
> 
> Would you please take a look of my profiling data? Do you think this is the best parallel efficiency I can get from Petsc? How can I improve it? 
> 
> Best,
> 
> Lei Shi
> 
> 
> 
> 
> 
> 
> Sincerely Yours,
> 
> Lei Shi 
> ---------
> 
> On Thu, Jun 25, 2015 at 5:33 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
> > On Jun 25, 2015, at 3:48 PM, Lei Shi <stoneszone at gmail.com> wrote:
> >
> > Hi Justin,
> >
> > Thanks for your suggestion. I will test it asap.
> >
> > Another thing confusing me is the wclock time with 2 cores is almost the same as the serial run when I use asm with sub_ksp_type gmres and ilu0 on subdomains. Serial run takes 11.95 sec and parallel run takes 10.5 sec. There is almost no speedup at all.
> 
>    On one process ASM is ilu(0), so the setup time is one ILU(0) factorization of the entire matrix. On two processes the ILU(0) is run on a matrix that is more than 1/2 the size of the matrix; due to the overlap of 1. In particular for small problems the overlap will pull in most of the matrix so the setup time is not 1/2 of the setup time of one process. Then the number of iterations increases a good amount in going from 1 to 2 processes. In combination this means that ASM going from one to two process requires one each process much more than 1/2 the work of running on 1 process so you should not expect great speedup in going from one to two processes. 
> 
> 
> >
> > And I found some other people got similar bad speedups when comparing 2 cores with 1 core. Attached is one slide from J.A. Davis's presentation. I just found it from the web. As you can see, asm with 2 cores takes almost the same cpu times compare 1 core too! May be I miss understanding some fundamental things related to asm.
> >
> > cores iterations      err     petsc solve wclock time speedup efficiency
> > 1     2       1.15E-04        11.95   1
> > 2     5       2.05E-02        10.5    1.01    0.50
> > 4     6       2.19E-02        7.64    1.39    0.34
> >
> >
> >
> >
> >
> >
> >
> > <Screenshot - 06252015 - 03:44:53 PM.png>
> > 
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 3:34 PM, Justin Chang <jychang48 at gmail.com> wrote:
> > Hi Lei,
> >
> > Depending on your machine and MPI library, you may have to use smart process to core/socket bindings to achieve better speedup. Instructions can be found here:
> >
> > http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
> >
> >
> > Justin
> >
> > On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hi Matt,
> >
> > Thanks for your suggestions. Here is the output from Stream test on one node which has 20 cores. I run it up to 20. Attached are the dumped output with your suggested options. Really appreciate your help!!!
> >
> > Number of MPI processes 1
> > Function      Rate (MB/s)
> > Copy:       13816.9372
> > Scale:       8020.1809
> > Add:        12762.3830
> > Triad:      11852.5016
> >
> > Number of MPI processes 2
> > Function      Rate (MB/s)
> > Copy:       22748.7681
> > Scale:      14081.4906
> > Add:        18998.4516
> > Triad:      18303.2494
> >
> > Number of MPI processes 3
> > Function      Rate (MB/s)
> > Copy:       34045.2510
> > Scale:      23410.9767
> > Add:        30320.2702
> > Triad:      30163.7977
> >
> > Number of MPI processes 4
> > Function      Rate (MB/s)
> > Copy:       36875.5349
> > Scale:      29440.1694
> > Add:        36971.1860
> > Triad:      37377.0103
> >
> > Number of MPI processes 5
> > Function      Rate (MB/s)
> > Copy:       32272.8763
> > Scale:      30316.3435
> > Add:        38022.0193
> > Triad:      38815.4830
> >
> > Number of MPI processes 6
> > Function      Rate (MB/s)
> > Copy:       35619.8925
> > Scale:      34457.5078
> > Add:        41419.3722
> > Triad:      35825.3621
> >
> > Number of MPI processes 7
> > Function      Rate (MB/s)
> > Copy:       55284.2420
> > Scale:      47706.8009
> > Add:        59076.4735
> > Triad:      61680.5559
> >
> > Number of MPI processes 8
> > Function      Rate (MB/s)
> > Copy:       44525.8901
> > Scale:      48949.9599
> > Add:        57437.7784
> > Triad:      56671.0593
> >
> > Number of MPI processes 9
> > Function      Rate (MB/s)
> > Copy:       34375.7364
> > Scale:      29507.5293
> > Add:        45405.3120
> > Triad:      39518.7559
> >
> > Number of MPI processes 10
> > Function      Rate (MB/s)
> > Copy:       34278.0415
> > Scale:      41721.7843
> > Add:        46642.2465
> > Triad:      45454.7000
> >
> > Number of MPI processes 11
> > Function      Rate (MB/s)
> > Copy:       38093.7244
> > Scale:      35147.2412
> > Add:        45047.0853
> > Triad:      44983.2013
> >
> > Number of MPI processes 12
> > Function      Rate (MB/s)
> > Copy:       39750.8760
> > Scale:      52038.0631
> > Add:        55552.9503
> > Triad:      54884.3839
> >
> > Number of MPI processes 13
> > Function      Rate (MB/s)
> > Copy:       60839.0248
> > Scale:      74143.7458
> > Add:        85545.3135
> > Triad:      85667.6551
> >
> > Number of MPI processes 14
> > Function      Rate (MB/s)
> > Copy:       37766.2343
> > Scale:      40279.1928
> > Add:        49992.8572
> > Triad:      50303.4809
> >
> > Number of MPI processes 15
> > Function      Rate (MB/s)
> > Copy:       49762.3670
> > Scale:      59077.8251
> > Add:        60407.9651
> > Triad:      61691.9456
> >
> > Number of MPI processes 16
> > Function      Rate (MB/s)
> > Copy:       31996.7169
> > Scale:      36962.4860
> > Add:        40183.5060
> > Triad:      41096.0512
> >
> > Number of MPI processes 17
> > Function      Rate (MB/s)
> > Copy:       36348.3839
> > Scale:      39108.6761
> > Add:        46853.4476
> > Triad:      47266.1778
> >
> > Number of MPI processes 18
> > Function      Rate (MB/s)
> > Copy:       40438.7558
> > Scale:      43195.5785
> > Add:        53063.4321
> > Triad:      53605.0293
> >
> > Number of MPI processes 19
> > Function      Rate (MB/s)
> > Copy:       30739.4908
> > Scale:      34280.8118
> > Add:        40710.5155
> > Triad:      43330.9503
> >
> > Number of MPI processes 20
> > Function      Rate (MB/s)
> > Copy:       37488.3777
> > Scale:      41791.8999
> > Add:        49518.9604
> > Triad:      48908.2677
> > ------------------------------------------------
> > np  speedup
> > 1 1.0
> > 2 1.54
> > 3 2.54
> > 4 3.15
> > 5 3.27
> > 6 3.02
> > 7 5.2
> > 8 4.78
> > 9 3.33
> > 10 3.84
> > 11 3.8
> > 12 4.63
> > 13 7.23
> > 14 4.24
> > 15 5.2
> > 16 3.47
> > 17 3.99
> > 18 4.52
> > 19 3.66
> > 20 4.13
> >
> >
> >
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com> wrote:
> > On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hello,
> >
> > 1) In order to understand this, we have to disentagle the various effect. First, run the STREAMS benchmark
> >
> >   make NPMAX=4 streams
> >
> > This will tell you the maximum speedup you can expect on this machine.
> >
> > 2) For these test cases, also send the output of
> >
> >   -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
> >
> >   Thanks,
> >
> >      Matt
> >
> > I'm trying to improve the parallel efficiency of gmres solve in my. In my CFD solver, Petsc gmres is used to solve the linear system generated by the Newton's method. To test its efficiency, I started with a very simple inviscid subsonic 3D flow as the first testcase. The parallel efficiency of gmres solve with asm as the preconditioner is very bad. The results are from our latest cluster. Right now, I'm only looking at the wclock time of the ksp_solve.
> >       • First I tested ASM with gmres and ilu 0 for the sub domain , the cpu time of 2 cores is almost the same as the serial run. Here is the options for this case
> > -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30
> > -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0
> > -sub_pc_factor_fill 1.9
> > The iteration numbers increase a lot for parallel run.
> > cores iterations      err     petsc solve wclock time speedup efficiency
> > 1     2       1.15E-04        11.95   1
> > 2     5       2.05E-02        10.5    1.01    0.50
> > 4     6       2.19E-02        7.64    1.39    0.34
> >
> >
> >
> >
> >
> >
> >
> >       2.  Then I tested ASM with ilu 0 as the preconditoner only, the cpu time of 2 cores is better than the 1st test, but the speedup is still very bad. Here is the options i'm using
> > -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0  -sub_pc_factor_fill 1.9
> > cores iterations      err     petsc solve cpu time    speedup efficiency
> > 1     10      4.54E-04        10.68   1
> > 2     11      9.55E-04        8.2     1.30    0.65
> > 4     12      3.59E-04        5.26    2.03    0.50
> >
> >
> >
> >
> >
> >
> >
> >    Those results are from a third order "DG" scheme with a very coarse 3D mesh (480 elements). I believe I should get some speedups for this test even on this coarse mesh.
> >
> >   My question is why does the asm with a local solve take much longer time than the asm as a preconditioner only? Also the accuracy is very bad too I have tested changing the overlap of asm to 2, but make it even worse.
> >
> >   If I used a larger mesh ~4000 elements, the 2nd case with asm as the preconditioner gives me a better speedup, but still not very good.
> >
> > cores iterations      err     petsc solve cpu time    speedup efficiency
> > 1     7       1.91E-02        97.32   1
> > 2     7       2.07E-02        64.94   1.5     0.74
> > 4     7       2.61E-02        36.97   2.6     0.65
> >
> >
> > Attached are the log_summary dumped from petsc, any suggestions are welcome. I really appreciate it.
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
> > -- Norbert Wiener
> >
> >
> >
> 
> 
> 
> Sincerely Yours,
> 
> Lei Shi 
> ---------
> 
> On Thu, Jun 25, 2015 at 5:33 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
> > On Jun 25, 2015, at 3:48 PM, Lei Shi <stoneszone at gmail.com> wrote:
> >
> > Hi Justin,
> >
> > Thanks for your suggestion. I will test it asap.
> >
> > Another thing confusing me is the wclock time with 2 cores is almost the same as the serial run when I use asm with sub_ksp_type gmres and ilu0 on subdomains. Serial run takes 11.95 sec and parallel run takes 10.5 sec. There is almost no speedup at all.
> 
>    On one process ASM is ilu(0), so the setup time is one ILU(0) factorization of the entire matrix. On two processes the ILU(0) is run on a matrix that is more than 1/2 the size of the matrix; due to the overlap of 1. In particular for small problems the overlap will pull in most of the matrix so the setup time is not 1/2 of the setup time of one process. Then the number of iterations increases a good amount in going from 1 to 2 processes. In combination this means that ASM going from one to two process requires one each process much more than 1/2 the work of running on 1 process so you should not expect great speedup in going from one to two processes.
> 
> 
> 
> >
> > And I found some other people got similar bad speedups when comparing 2 cores with 1 core. Attached is one slide from J.A. Davis's presentation. I just found it from the web. As you can see, asm with 2 cores takes almost the same cpu times compare 1 core too! May be I miss understanding some fundamental things related to asm.
> >
> > cores iterations      err     petsc solve wclock time speedup efficiency
> > 1     2       1.15E-04        11.95   1
> > 2     5       2.05E-02        10.5    1.01    0.50
> > 4     6       2.19E-02        7.64    1.39    0.34
> >
> >
> >
> >
> >
> >
> >
> > <Screenshot - 06252015 - 03:44:53 PM.png>
> > 
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 3:34 PM, Justin Chang <jychang48 at gmail.com> wrote:
> > Hi Lei,
> >
> > Depending on your machine and MPI library, you may have to use smart process to core/socket bindings to achieve better speedup. Instructions can be found here:
> >
> > http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
> >
> >
> > Justin
> >
> > On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hi Matt,
> >
> > Thanks for your suggestions. Here is the output from Stream test on one node which has 20 cores. I run it up to 20. Attached are the dumped output with your suggested options. Really appreciate your help!!!
> >
> > Number of MPI processes 1
> > Function      Rate (MB/s)
> > Copy:       13816.9372
> > Scale:       8020.1809
> > Add:        12762.3830
> > Triad:      11852.5016
> >
> > Number of MPI processes 2
> > Function      Rate (MB/s)
> > Copy:       22748.7681
> > Scale:      14081.4906
> > Add:        18998.4516
> > Triad:      18303.2494
> >
> > Number of MPI processes 3
> > Function      Rate (MB/s)
> > Copy:       34045.2510
> > Scale:      23410.9767
> > Add:        30320.2702
> > Triad:      30163.7977
> >
> > Number of MPI processes 4
> > Function      Rate (MB/s)
> > Copy:       36875.5349
> > Scale:      29440.1694
> > Add:        36971.1860
> > Triad:      37377.0103
> >
> > Number of MPI processes 5
> > Function      Rate (MB/s)
> > Copy:       32272.8763
> > Scale:      30316.3435
> > Add:        38022.0193
> > Triad:      38815.4830
> >
> > Number of MPI processes 6
> > Function      Rate (MB/s)
> > Copy:       35619.8925
> > Scale:      34457.5078
> > Add:        41419.3722
> > Triad:      35825.3621
> >
> > Number of MPI processes 7
> > Function      Rate (MB/s)
> > Copy:       55284.2420
> > Scale:      47706.8009
> > Add:        59076.4735
> > Triad:      61680.5559
> >
> > Number of MPI processes 8
> > Function      Rate (MB/s)
> > Copy:       44525.8901
> > Scale:      48949.9599
> > Add:        57437.7784
> > Triad:      56671.0593
> >
> > Number of MPI processes 9
> > Function      Rate (MB/s)
> > Copy:       34375.7364
> > Scale:      29507.5293
> > Add:        45405.3120
> > Triad:      39518.7559
> >
> > Number of MPI processes 10
> > Function      Rate (MB/s)
> > Copy:       34278.0415
> > Scale:      41721.7843
> > Add:        46642.2465
> > Triad:      45454.7000
> >
> > Number of MPI processes 11
> > Function      Rate (MB/s)
> > Copy:       38093.7244
> > Scale:      35147.2412
> > Add:        45047.0853
> > Triad:      44983.2013
> >
> > Number of MPI processes 12
> > Function      Rate (MB/s)
> > Copy:       39750.8760
> > Scale:      52038.0631
> > Add:        55552.9503
> > Triad:      54884.3839
> >
> > Number of MPI processes 13
> > Function      Rate (MB/s)
> > Copy:       60839.0248
> > Scale:      74143.7458
> > Add:        85545.3135
> > Triad:      85667.6551
> >
> > Number of MPI processes 14
> > Function      Rate (MB/s)
> > Copy:       37766.2343
> > Scale:      40279.1928
> > Add:        49992.8572
> > Triad:      50303.4809
> >
> > Number of MPI processes 15
> > Function      Rate (MB/s)
> > Copy:       49762.3670
> > Scale:      59077.8251
> > Add:        60407.9651
> > Triad:      61691.9456
> >
> > Number of MPI processes 16
> > Function      Rate (MB/s)
> > Copy:       31996.7169
> > Scale:      36962.4860
> > Add:        40183.5060
> > Triad:      41096.0512
> >
> > Number of MPI processes 17
> > Function      Rate (MB/s)
> > Copy:       36348.3839
> > Scale:      39108.6761
> > Add:        46853.4476
> > Triad:      47266.1778
> >
> > Number of MPI processes 18
> > Function      Rate (MB/s)
> > Copy:       40438.7558
> > Scale:      43195.5785
> > Add:        53063.4321
> > Triad:      53605.0293
> >
> > Number of MPI processes 19
> > Function      Rate (MB/s)
> > Copy:       30739.4908
> > Scale:      34280.8118
> > Add:        40710.5155
> > Triad:      43330.9503
> >
> > Number of MPI processes 20
> > Function      Rate (MB/s)
> > Copy:       37488.3777
> > Scale:      41791.8999
> > Add:        49518.9604
> > Triad:      48908.2677
> > ------------------------------------------------
> > np  speedup
> > 1 1.0
> > 2 1.54
> > 3 2.54
> > 4 3.15
> > 5 3.27
> > 6 3.02
> > 7 5.2
> > 8 4.78
> > 9 3.33
> > 10 3.84
> > 11 3.8
> > 12 4.63
> > 13 7.23
> > 14 4.24
> > 15 5.2
> > 16 3.47
> > 17 3.99
> > 18 4.52
> > 19 3.66
> > 20 4.13
> >
> >
> >
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com> wrote:
> > On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hello,
> >
> > 1) In order to understand this, we have to disentagle the various effect. First, run the STREAMS benchmark
> >
> >   make NPMAX=4 streams
> >
> > This will tell you the maximum speedup you can expect on this machine.
> >
> > 2) For these test cases, also send the output of
> >
> >   -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
> >
> >   Thanks,
> >
> >      Matt
> >
> > I'm trying to improve the parallel efficiency of gmres solve in my. In my CFD solver, Petsc gmres is used to solve the linear system generated by the Newton's method. To test its efficiency, I started with a very simple inviscid subsonic 3D flow as the first testcase. The parallel efficiency of gmres solve with asm as the preconditioner is very bad. The results are from our latest cluster. Right now, I'm only looking at the wclock time of the ksp_solve.
> >       • First I tested ASM with gmres and ilu 0 for the sub domain , the cpu time of 2 cores is almost the same as the serial run. Here is the options for this case
> > -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30
> > -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0
> > -sub_pc_factor_fill 1.9
> > The iteration numbers increase a lot for parallel run.
> > cores iterations      err     petsc solve wclock time speedup efficiency
> > 1     2       1.15E-04        11.95   1
> > 2     5       2.05E-02        10.5    1.01    0.50
> > 4     6       2.19E-02        7.64    1.39    0.34
> >
> >
> >
> >
> >
> >
> >
> >       2.  Then I tested ASM with ilu 0 as the preconditoner only, the cpu time of 2 cores is better than the 1st test, but the speedup is still very bad. Here is the options i'm using
> > -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0  -sub_pc_factor_fill 1.9
> > cores iterations      err     petsc solve cpu time    speedup efficiency
> > 1     10      4.54E-04        10.68   1
> > 2     11      9.55E-04        8.2     1.30    0.65
> > 4     12      3.59E-04        5.26    2.03    0.50
> >
> >
> >
> >
> >
> >
> >
> >    Those results are from a third order "DG" scheme with a very coarse 3D mesh (480 elements). I believe I should get some speedups for this test even on this coarse mesh.
> >
> >   My question is why does the asm with a local solve take much longer time than the asm as a preconditioner only? Also the accuracy is very bad too I have tested changing the overlap of asm to 2, but make it even worse.
> >
> >   If I used a larger mesh ~4000 elements, the 2nd case with asm as the preconditioner gives me a better speedup, but still not very good.
> >
> > cores iterations      err     petsc solve cpu time    speedup efficiency
> > 1     7       1.91E-02        97.32   1
> > 2     7       2.07E-02        64.94   1.5     0.74
> > 4     7       2.61E-02        36.97   2.6     0.65
> >
> >
> > Attached are the log_summary dumped from petsc, any suggestions are welcome. I really appreciate it.
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
> > -- Norbert Wiener
> >
> >
> >
> 
>