[petsc-users] Parallel efficiency of the gmres solver with ASM

Thu Jun 25 21:25:01 CDT 2015

Barry,

Thanks a lot for your reply. Your explanation helps me understand my test
results. So In this case, to compute the speedup for a strong scalability
test, I should use the the wall clock time with multiple cores as a
reference time instead of serial run time?

e.g. for computing speed up of 16 cores, i should use
[image: speedup=\frac{4 \times wclock_{4core}}{wclock_{16core}}]

instead of using

[image: speedup=\frac{wclock_{1core}}{wclock_{16core}}]

Another question is when I use asm as a preconditioner only, the speedup of
2 cores is much better than the case using asm with a local solve
sub_ksp_type gmres.

-ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
-ksp_gmres_restart 30 -ksp_pc_side right
-pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0  -sub_pc_factor_fill
1.9

coresiterationserrpetsc solve cpu timespeedupefficiency1104.54E-0410.681211
9.55E-048.21.300.654123.59E-045.262.030.50

What is the main differences between those two? Thanks.

Would you please take a look of my profiling data? Do you think this is the
best parallel efficiency I can get from Petsc? How can I improve it?

Best,

Lei Shi

Sincerely Yours,

Lei Shi
---------

On Thu, Jun 25, 2015 at 5:33 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
> > On Jun 25, 2015, at 3:48 PM, Lei Shi <stoneszone at gmail.com> wrote:
> >
> > Hi Justin,
> >
> > Thanks for your suggestion. I will test it asap.
> >
> > Another thing confusing me is the wclock time with 2 cores is almost the
> same as the serial run when I use asm with sub_ksp_type gmres and ilu0 on
> subdomains. Serial run takes 11.95 sec and parallel run takes 10.5 sec.
> There is almost no speedup at all.
>
>    On one process ASM is ilu(0), so the setup time is one ILU(0)
> factorization of the entire matrix. On two processes the ILU(0) is run on a
> matrix that is more than 1/2 the size of the matrix; due to the overlap of
> 1. In particular for small problems the overlap will pull in most of the
> matrix so the setup time is not 1/2 of the setup time of one process. Then
> the number of iterations increases a good amount in going from 1 to 2
> processes. In combination this means that ASM going from one to two process
> requires one each process much more than 1/2 the work of running on 1
> process so you should not expect great speedup in going from one to two
> processes.

>
> >
> > And I found some other people got similar bad speedups when comparing 2
> cores with 1 core. Attached is one slide from J.A. Davis's presentation. I
> just found it from the web. As you can see, asm with 2 cores takes almost
> the same cpu times compare 1 core too! May be I miss understanding some
> fundamental things related to asm.
> >
> > cores iterations      err     petsc solve wclock time speedup efficiency
> > 1     2       1.15E-04        11.95   1
> > 2     5       2.05E-02        10.5    1.01    0.50
> > 4     6       2.19E-02        7.64    1.39    0.34
> >
> >
> >
> >
> >
> >
> >
> > <Screenshot - 06252015 - 03:44:53 PM.png>
> > 
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 3:34 PM, Justin Chang <jychang48 at gmail.com>
> wrote:
> > Hi Lei,
> >
> > Depending on your machine and MPI library, you may have to use smart
> process to core/socket bindings to achieve better speedup. Instructions can
> be found here:
> >
> > http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
> >
> >
> > Justin
> >
> > On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hi Matt,
> >
> > Thanks for your suggestions. Here is the output from Stream test on one
> node which has 20 cores. I run it up to 20. Attached are the dumped output
> with your suggested options. Really appreciate your help!!!
> >
> > Number of MPI processes 1
> > Function      Rate (MB/s)
> > Copy:       13816.9372
> > Scale:       8020.1809
> > Add:        12762.3830
> > Triad:      11852.5016
> >
> > Number of MPI processes 2
> > Function      Rate (MB/s)
> > Copy:       22748.7681
> > Scale:      14081.4906
> > Add:        18998.4516
> > Triad:      18303.2494
> >
> > Number of MPI processes 3
> > Function      Rate (MB/s)
> > Copy:       34045.2510
> > Scale:      23410.9767
> > Add:        30320.2702
> > Triad:      30163.7977
> >
> > Number of MPI processes 4
> > Function      Rate (MB/s)
> > Copy:       36875.5349
> > Scale:      29440.1694
> > Add:        36971.1860
> > Triad:      37377.0103
> >
> > Number of MPI processes 5
> > Function      Rate (MB/s)
> > Copy:       32272.8763
> > Scale:      30316.3435
> > Add:        38022.0193
> > Triad:      38815.4830
> >
> > Number of MPI processes 6
> > Function      Rate (MB/s)
> > Copy:       35619.8925
> > Scale:      34457.5078
> > Add:        41419.3722
> > Triad:      35825.3621
> >
> > Number of MPI processes 7
> > Function      Rate (MB/s)
> > Copy:       55284.2420
> > Scale:      47706.8009
> > Add:        59076.4735
> > Triad:      61680.5559
> >
> > Number of MPI processes 8
> > Function      Rate (MB/s)
> > Copy:       44525.8901
> > Scale:      48949.9599
> > Add:        57437.7784
> > Triad:      56671.0593
> >
> > Number of MPI processes 9
> > Function      Rate (MB/s)
> > Copy:       34375.7364
> > Scale:      29507.5293
> > Add:        45405.3120
> > Triad:      39518.7559
> >
> > Number of MPI processes 10
> > Function      Rate (MB/s)
> > Copy:       34278.0415
> > Scale:      41721.7843
> > Add:        46642.2465
> > Triad:      45454.7000
> >
> > Number of MPI processes 11
> > Function      Rate (MB/s)
> > Copy:       38093.7244
> > Scale:      35147.2412
> > Add:        45047.0853
> > Triad:      44983.2013
> >
> > Number of MPI processes 12
> > Function      Rate (MB/s)
> > Copy:       39750.8760
> > Scale:      52038.0631
> > Add:        55552.9503
> > Triad:      54884.3839
> >
> > Number of MPI processes 13
> > Function      Rate (MB/s)
> > Copy:       60839.0248
> > Scale:      74143.7458
> > Add:        85545.3135
> > Triad:      85667.6551
> >
> > Number of MPI processes 14
> > Function      Rate (MB/s)
> > Copy:       37766.2343
> > Scale:      40279.1928
> > Add:        49992.8572
> > Triad:      50303.4809
> >
> > Number of MPI processes 15
> > Function      Rate (MB/s)
> > Copy:       49762.3670
> > Scale:      59077.8251
> > Add:        60407.9651
> > Triad:      61691.9456
> >
> > Number of MPI processes 16
> > Function      Rate (MB/s)
> > Copy:       31996.7169
> > Scale:      36962.4860
> > Add:        40183.5060
> > Triad:      41096.0512
> >
> > Number of MPI processes 17
> > Function      Rate (MB/s)
> > Copy:       36348.3839
> > Scale:      39108.6761
> > Add:        46853.4476
> > Triad:      47266.1778
> >
> > Number of MPI processes 18
> > Function      Rate (MB/s)
> > Copy:       40438.7558
> > Scale:      43195.5785
> > Add:        53063.4321
> > Triad:      53605.0293
> >
> > Number of MPI processes 19
> > Function      Rate (MB/s)
> > Copy:       30739.4908
> > Scale:      34280.8118
> > Add:        40710.5155
> > Triad:      43330.9503
> >
> > Number of MPI processes 20
> > Function      Rate (MB/s)
> > Copy:       37488.3777
> > Scale:      41791.8999
> > Add:        49518.9604
> > Triad:      48908.2677
> > ------------------------------------------------
> > np  speedup
> > 1 1.0
> > 2 1.54
> > 3 2.54
> > 4 3.15
> > 5 3.27
> > 6 3.02
> > 7 5.2
> > 8 4.78
> > 9 3.33
> > 10 3.84
> > 11 3.8
> > 12 4.63
> > 13 7.23
> > 14 4.24
> > 15 5.2
> > 16 3.47
> > 17 3.99
> > 18 4.52
> > 19 3.66
> > 20 4.13
> >
> >
> >
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com>
> wrote:
> > On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hello,
> >
> > 1) In order to understand this, we have to disentagle the various
> effect. First, run the STREAMS benchmark
> >
> >   make NPMAX=4 streams
> >
> > This will tell you the maximum speedup you can expect on this machine.
> >
> > 2) For these test cases, also send the output of
> >
> >   -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
> >
> >   Thanks,
> >
> >      Matt
> >
> > I'm trying to improve the parallel efficiency of gmres solve in my. In
> my CFD solver, Petsc gmres is used to solve the linear system generated by
> the Newton's method. To test its efficiency, I started with a very simple
> inviscid subsonic 3D flow as the first testcase. The parallel efficiency of
> gmres solve with asm as the preconditioner is very bad. The results are
> from our latest cluster. Right now, I'm only looking at the wclock time of
> the ksp_solve.
> >       • First I tested ASM with gmres and ilu 0 for the sub domain , the
> cpu time of 2 cores is almost the same as the serial run. Here is the
> options for this case
> > -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30
> > -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0
> > -sub_pc_factor_fill 1.9
> > The iteration numbers increase a lot for parallel run.
> > cores iterations      err     petsc solve wclock time speedup efficiency
> > 1     2       1.15E-04        11.95   1
> > 2     5       2.05E-02        10.5    1.01    0.50
> > 4     6       2.19E-02        7.64    1.39    0.34
> >
> >
> >
> >
> >
> >
> >
> >       2.  Then I tested ASM with ilu 0 as the preconditoner only, the
> cpu time of 2 cores is better than the 1st test, but the speedup is still
> very bad. Here is the options i'm using
> > -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0
> -sub_pc_factor_fill 1.9
> > cores iterations      err     petsc solve cpu time    speedup efficiency
> > 1     10      4.54E-04        10.68   1
> > 2     11      9.55E-04        8.2     1.30    0.65
> > 4     12      3.59E-04        5.26    2.03    0.50
> >
> >
> >
> >
> >
> >
> >
> >    Those results are from a third order "DG" scheme with a very coarse
> 3D mesh (480 elements). I believe I should get some speedups for this test
> even on this coarse mesh.
> >
> >   My question is why does the asm with a local solve take much longer
> time than the asm as a preconditioner only? Also the accuracy is very bad
> too I have tested changing the overlap of asm to 2, but make it even worse.
> >
> >   If I used a larger mesh ~4000 elements, the 2nd case with asm as the
> preconditioner gives me a better speedup, but still not very good.
> >
> > cores iterations      err     petsc solve cpu time    speedup efficiency
> > 1     7       1.91E-02        97.32   1
> > 2     7       2.07E-02        64.94   1.5     0.74
> > 4     7       2.61E-02        36.97   2.6     0.65
> >
> >
> > Attached are the log_summary dumped from petsc, any suggestions are
> welcome. I really appreciate it.
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> > -- Norbert Wiener
> >
> >
> >
>
>

Sincerely Yours,

Lei Shi
---------

On Thu, Jun 25, 2015 at 5:33 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
> > On Jun 25, 2015, at 3:48 PM, Lei Shi <stoneszone at gmail.com> wrote:
> >
> > Hi Justin,
> >
> > Thanks for your suggestion. I will test it asap.
> >
> > Another thing confusing me is the wclock time with 2 cores is almost the
> same as the serial run when I use asm with sub_ksp_type gmres and ilu0 on
> subdomains. Serial run takes 11.95 sec and parallel run takes 10.5 sec.
> There is almost no speedup at all.
>
>    On one process ASM is ilu(0), so the setup time is one ILU(0)
> factorization of the entire matrix. On two processes the ILU(0) is run on a
> matrix that is more than 1/2 the size of the matrix; due to the overlap of
> 1. In particular for small problems the overlap will pull in most of the
> matrix so the setup time is not 1/2 of the setup time of one process. Then
> the number of iterations increases a good amount in going from 1 to 2
> processes. In combination this means that ASM going from one to two process
> requires one each process much more than 1/2 the work of running on 1
> process so you should not expect great speedup in going from one to two
> processes.
>
>
>
> >
> > And I found some other people got similar bad speedups when comparing 2
> cores with 1 core. Attached is one slide from J.A. Davis's presentation. I
> just found it from the web. As you can see, asm with 2 cores takes almost
> the same cpu times compare 1 core too! May be I miss understanding some
> fundamental things related to asm.
> >
> > cores iterations      err     petsc solve wclock time speedup efficiency
> > 1     2       1.15E-04        11.95   1
> > 2     5       2.05E-02        10.5    1.01    0.50
> > 4     6       2.19E-02        7.64    1.39    0.34
> >
> >
> >
> >
> >
> >
> >
> > <Screenshot - 06252015 - 03:44:53 PM.png>
> > 
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 3:34 PM, Justin Chang <jychang48 at gmail.com>
> wrote:
> > Hi Lei,
> >
> > Depending on your machine and MPI library, you may have to use smart
> process to core/socket bindings to achieve better speedup. Instructions can
> be found here:
> >
> > http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
> >
> >
> > Justin
> >
> > On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hi Matt,
> >
> > Thanks for your suggestions. Here is the output from Stream test on one
> node which has 20 cores. I run it up to 20. Attached are the dumped output
> with your suggested options. Really appreciate your help!!!
> >
> > Number of MPI processes 1
> > Function      Rate (MB/s)
> > Copy:       13816.9372
> > Scale:       8020.1809
> > Add:        12762.3830
> > Triad:      11852.5016
> >
> > Number of MPI processes 2
> > Function      Rate (MB/s)
> > Copy:       22748.7681
> > Scale:      14081.4906
> > Add:        18998.4516
> > Triad:      18303.2494
> >
> > Number of MPI processes 3
> > Function      Rate (MB/s)
> > Copy:       34045.2510
> > Scale:      23410.9767
> > Add:        30320.2702
> > Triad:      30163.7977
> >
> > Number of MPI processes 4
> > Function      Rate (MB/s)
> > Copy:       36875.5349
> > Scale:      29440.1694
> > Add:        36971.1860
> > Triad:      37377.0103
> >
> > Number of MPI processes 5
> > Function      Rate (MB/s)
> > Copy:       32272.8763
> > Scale:      30316.3435
> > Add:        38022.0193
> > Triad:      38815.4830
> >
> > Number of MPI processes 6
> > Function      Rate (MB/s)
> > Copy:       35619.8925
> > Scale:      34457.5078
> > Add:        41419.3722
> > Triad:      35825.3621
> >
> > Number of MPI processes 7
> > Function      Rate (MB/s)
> > Copy:       55284.2420
> > Scale:      47706.8009
> > Add:        59076.4735
> > Triad:      61680.5559
> >
> > Number of MPI processes 8
> > Function      Rate (MB/s)
> > Copy:       44525.8901
> > Scale:      48949.9599
> > Add:        57437.7784
> > Triad:      56671.0593
> >
> > Number of MPI processes 9
> > Function      Rate (MB/s)
> > Copy:       34375.7364
> > Scale:      29507.5293
> > Add:        45405.3120
> > Triad:      39518.7559
> >
> > Number of MPI processes 10
> > Function      Rate (MB/s)
> > Copy:       34278.0415
> > Scale:      41721.7843
> > Add:        46642.2465
> > Triad:      45454.7000
> >
> > Number of MPI processes 11
> > Function      Rate (MB/s)
> > Copy:       38093.7244
> > Scale:      35147.2412
> > Add:        45047.0853
> > Triad:      44983.2013
> >
> > Number of MPI processes 12
> > Function      Rate (MB/s)
> > Copy:       39750.8760
> > Scale:      52038.0631
> > Add:        55552.9503
> > Triad:      54884.3839
> >
> > Number of MPI processes 13
> > Function      Rate (MB/s)
> > Copy:       60839.0248
> > Scale:      74143.7458
> > Add:        85545.3135
> > Triad:      85667.6551
> >
> > Number of MPI processes 14
> > Function      Rate (MB/s)
> > Copy:       37766.2343
> > Scale:      40279.1928
> > Add:        49992.8572
> > Triad:      50303.4809
> >
> > Number of MPI processes 15
> > Function      Rate (MB/s)
> > Copy:       49762.3670
> > Scale:      59077.8251
> > Add:        60407.9651
> > Triad:      61691.9456
> >
> > Number of MPI processes 16
> > Function      Rate (MB/s)
> > Copy:       31996.7169
> > Scale:      36962.4860
> > Add:        40183.5060
> > Triad:      41096.0512
> >
> > Number of MPI processes 17
> > Function      Rate (MB/s)
> > Copy:       36348.3839
> > Scale:      39108.6761
> > Add:        46853.4476
> > Triad:      47266.1778
> >
> > Number of MPI processes 18
> > Function      Rate (MB/s)
> > Copy:       40438.7558
> > Scale:      43195.5785
> > Add:        53063.4321
> > Triad:      53605.0293
> >
> > Number of MPI processes 19
> > Function      Rate (MB/s)
> > Copy:       30739.4908
> > Scale:      34280.8118
> > Add:        40710.5155
> > Triad:      43330.9503
> >
> > Number of MPI processes 20
> > Function      Rate (MB/s)
> > Copy:       37488.3777
> > Scale:      41791.8999
> > Add:        49518.9604
> > Triad:      48908.2677
> > ------------------------------------------------
> > np  speedup
> > 1 1.0
> > 2 1.54
> > 3 2.54
> > 4 3.15
> > 5 3.27
> > 6 3.02
> > 7 5.2
> > 8 4.78
> > 9 3.33
> > 10 3.84
> > 11 3.8
> > 12 4.63
> > 13 7.23
> > 14 4.24
> > 15 5.2
> > 16 3.47
> > 17 3.99
> > 18 4.52
> > 19 3.66
> > 20 4.13
> >
> >
> >
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com>
> wrote:
> > On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hello,
> >
> > 1) In order to understand this, we have to disentagle the various
> effect. First, run the STREAMS benchmark
> >
> >   make NPMAX=4 streams
> >
> > This will tell you the maximum speedup you can expect on this machine.
> >
> > 2) For these test cases, also send the output of
> >
> >   -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
> >
> >   Thanks,
> >
> >      Matt
> >
> > I'm trying to improve the parallel efficiency of gmres solve in my. In
> my CFD solver, Petsc gmres is used to solve the linear system generated by
> the Newton's method. To test its efficiency, I started with a very simple
> inviscid subsonic 3D flow as the first testcase. The parallel efficiency of
> gmres solve with asm as the preconditioner is very bad. The results are
> from our latest cluster. Right now, I'm only looking at the wclock time of
> the ksp_solve.
> >       • First I tested ASM with gmres and ilu 0 for the sub domain , the
> cpu time of 2 cores is almost the same as the serial run. Here is the
> options for this case
> > -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30
> > -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0
> > -sub_pc_factor_fill 1.9
> > The iteration numbers increase a lot for parallel run.
> > cores iterations      err     petsc solve wclock time speedup efficiency
> > 1     2       1.15E-04        11.95   1
> > 2     5       2.05E-02        10.5    1.01    0.50
> > 4     6       2.19E-02        7.64    1.39    0.34
> >
> >
> >
> >
> >
> >
> >
> >       2.  Then I tested ASM with ilu 0 as the preconditoner only, the
> cpu time of 2 cores is better than the 1st test, but the speedup is still
> very bad. Here is the options i'm using
> > -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0
> -sub_pc_factor_fill 1.9
> > cores iterations      err     petsc solve cpu time    speedup efficiency
> > 1     10      4.54E-04        10.68   1
> > 2     11      9.55E-04        8.2     1.30    0.65
> > 4     12      3.59E-04        5.26    2.03    0.50
> >
> >
> >
> >
> >
> >
> >
> >    Those results are from a third order "DG" scheme with a very coarse
> 3D mesh (480 elements). I believe I should get some speedups for this test
> even on this coarse mesh.
> >
> >   My question is why does the asm with a local solve take much longer
> time than the asm as a preconditioner only? Also the accuracy is very bad
> too I have tested changing the overlap of asm to 2, but make it even worse.
> >
> >   If I used a larger mesh ~4000 elements, the 2nd case with asm as the
> preconditioner gives me a better speedup, but still not very good.
> >
> > cores iterations      err     petsc solve cpu time    speedup efficiency
> > 1     7       1.91E-02        97.32   1
> > 2     7       2.07E-02        64.94   1.5     0.74
> > 4     7       2.61E-02        36.97   2.6     0.65
> >
> >
> > Attached are the log_summary dumped from petsc, any suggestions are
> welcome. I really appreciate it.
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> > -- Norbert Wiener
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/9e0a5848/attachment-0001.html>