[petsc-users] Parallel efficiency of the gmres solver with ASM
Lei Shi
stoneszone at gmail.com
Thu Jun 25 21:25:01 CDT 2015
Barry,
Thanks a lot for your reply. Your explanation helps me understand my test
results. So In this case, to compute the speedup for a strong scalability
test, I should use the the wall clock time with multiple cores as a
reference time instead of serial run time?
e.g. for computing speed up of 16 cores, i should use
[image: speedup=\frac{4 \times wclock_{4core}}{wclock_{16core}}]
instead of using
[image: speedup=\frac{wclock_{1core}}{wclock_{16core}}]
Another question is when I use asm as a preconditioner only, the speedup of
2 cores is much better than the case using asm with a local solve
sub_ksp_type gmres.
-ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
-ksp_gmres_restart 30 -ksp_pc_side right
-pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0 -sub_pc_factor_fill
1.9
coresiterationserrpetsc solve cpu timespeedupefficiency1104.54E-0410.681211
9.55E-048.21.300.654123.59E-045.262.030.50
What is the main differences between those two? Thanks.
Would you please take a look of my profiling data? Do you think this is the
best parallel efficiency I can get from Petsc? How can I improve it?
Best,
Lei Shi
Sincerely Yours,
Lei Shi
---------
On Thu, Jun 25, 2015 at 5:33 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
> > On Jun 25, 2015, at 3:48 PM, Lei Shi <stoneszone at gmail.com> wrote:
> >
> > Hi Justin,
> >
> > Thanks for your suggestion. I will test it asap.
> >
> > Another thing confusing me is the wclock time with 2 cores is almost the
> same as the serial run when I use asm with sub_ksp_type gmres and ilu0 on
> subdomains. Serial run takes 11.95 sec and parallel run takes 10.5 sec.
> There is almost no speedup at all.
>
> On one process ASM is ilu(0), so the setup time is one ILU(0)
> factorization of the entire matrix. On two processes the ILU(0) is run on a
> matrix that is more than 1/2 the size of the matrix; due to the overlap of
> 1. In particular for small problems the overlap will pull in most of the
> matrix so the setup time is not 1/2 of the setup time of one process. Then
> the number of iterations increases a good amount in going from 1 to 2
> processes. In combination this means that ASM going from one to two process
> requires one each process much more than 1/2 the work of running on 1
> process so you should not expect great speedup in going from one to two
> processes.
>
> >
> > And I found some other people got similar bad speedups when comparing 2
> cores with 1 core. Attached is one slide from J.A. Davis's presentation. I
> just found it from the web. As you can see, asm with 2 cores takes almost
> the same cpu times compare 1 core too! May be I miss understanding some
> fundamental things related to asm.
> >
> > cores iterations err petsc solve wclock time speedup efficiency
> > 1 2 1.15E-04 11.95 1
> > 2 5 2.05E-02 10.5 1.01 0.50
> > 4 6 2.19E-02 7.64 1.39 0.34
> >
> >
> >
> >
> >
> >
> >
> > <Screenshot - 06252015 - 03:44:53 PM.png>
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 3:34 PM, Justin Chang <jychang48 at gmail.com>
> wrote:
> > Hi Lei,
> >
> > Depending on your machine and MPI library, you may have to use smart
> process to core/socket bindings to achieve better speedup. Instructions can
> be found here:
> >
> > http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
> >
> >
> > Justin
> >
> > On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hi Matt,
> >
> > Thanks for your suggestions. Here is the output from Stream test on one
> node which has 20 cores. I run it up to 20. Attached are the dumped output
> with your suggested options. Really appreciate your help!!!
> >
> > Number of MPI processes 1
> > Function Rate (MB/s)
> > Copy: 13816.9372
> > Scale: 8020.1809
> > Add: 12762.3830
> > Triad: 11852.5016
> >
> > Number of MPI processes 2
> > Function Rate (MB/s)
> > Copy: 22748.7681
> > Scale: 14081.4906
> > Add: 18998.4516
> > Triad: 18303.2494
> >
> > Number of MPI processes 3
> > Function Rate (MB/s)
> > Copy: 34045.2510
> > Scale: 23410.9767
> > Add: 30320.2702
> > Triad: 30163.7977
> >
> > Number of MPI processes 4
> > Function Rate (MB/s)
> > Copy: 36875.5349
> > Scale: 29440.1694
> > Add: 36971.1860
> > Triad: 37377.0103
> >
> > Number of MPI processes 5
> > Function Rate (MB/s)
> > Copy: 32272.8763
> > Scale: 30316.3435
> > Add: 38022.0193
> > Triad: 38815.4830
> >
> > Number of MPI processes 6
> > Function Rate (MB/s)
> > Copy: 35619.8925
> > Scale: 34457.5078
> > Add: 41419.3722
> > Triad: 35825.3621
> >
> > Number of MPI processes 7
> > Function Rate (MB/s)
> > Copy: 55284.2420
> > Scale: 47706.8009
> > Add: 59076.4735
> > Triad: 61680.5559
> >
> > Number of MPI processes 8
> > Function Rate (MB/s)
> > Copy: 44525.8901
> > Scale: 48949.9599
> > Add: 57437.7784
> > Triad: 56671.0593
> >
> > Number of MPI processes 9
> > Function Rate (MB/s)
> > Copy: 34375.7364
> > Scale: 29507.5293
> > Add: 45405.3120
> > Triad: 39518.7559
> >
> > Number of MPI processes 10
> > Function Rate (MB/s)
> > Copy: 34278.0415
> > Scale: 41721.7843
> > Add: 46642.2465
> > Triad: 45454.7000
> >
> > Number of MPI processes 11
> > Function Rate (MB/s)
> > Copy: 38093.7244
> > Scale: 35147.2412
> > Add: 45047.0853
> > Triad: 44983.2013
> >
> > Number of MPI processes 12
> > Function Rate (MB/s)
> > Copy: 39750.8760
> > Scale: 52038.0631
> > Add: 55552.9503
> > Triad: 54884.3839
> >
> > Number of MPI processes 13
> > Function Rate (MB/s)
> > Copy: 60839.0248
> > Scale: 74143.7458
> > Add: 85545.3135
> > Triad: 85667.6551
> >
> > Number of MPI processes 14
> > Function Rate (MB/s)
> > Copy: 37766.2343
> > Scale: 40279.1928
> > Add: 49992.8572
> > Triad: 50303.4809
> >
> > Number of MPI processes 15
> > Function Rate (MB/s)
> > Copy: 49762.3670
> > Scale: 59077.8251
> > Add: 60407.9651
> > Triad: 61691.9456
> >
> > Number of MPI processes 16
> > Function Rate (MB/s)
> > Copy: 31996.7169
> > Scale: 36962.4860
> > Add: 40183.5060
> > Triad: 41096.0512
> >
> > Number of MPI processes 17
> > Function Rate (MB/s)
> > Copy: 36348.3839
> > Scale: 39108.6761
> > Add: 46853.4476
> > Triad: 47266.1778
> >
> > Number of MPI processes 18
> > Function Rate (MB/s)
> > Copy: 40438.7558
> > Scale: 43195.5785
> > Add: 53063.4321
> > Triad: 53605.0293
> >
> > Number of MPI processes 19
> > Function Rate (MB/s)
> > Copy: 30739.4908
> > Scale: 34280.8118
> > Add: 40710.5155
> > Triad: 43330.9503
> >
> > Number of MPI processes 20
> > Function Rate (MB/s)
> > Copy: 37488.3777
> > Scale: 41791.8999
> > Add: 49518.9604
> > Triad: 48908.2677
> > ------------------------------------------------
> > np speedup
> > 1 1.0
> > 2 1.54
> > 3 2.54
> > 4 3.15
> > 5 3.27
> > 6 3.02
> > 7 5.2
> > 8 4.78
> > 9 3.33
> > 10 3.84
> > 11 3.8
> > 12 4.63
> > 13 7.23
> > 14 4.24
> > 15 5.2
> > 16 3.47
> > 17 3.99
> > 18 4.52
> > 19 3.66
> > 20 4.13
> >
> >
> >
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com>
> wrote:
> > On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hello,
> >
> > 1) In order to understand this, we have to disentagle the various
> effect. First, run the STREAMS benchmark
> >
> > make NPMAX=4 streams
> >
> > This will tell you the maximum speedup you can expect on this machine.
> >
> > 2) For these test cases, also send the output of
> >
> > -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
> >
> > Thanks,
> >
> > Matt
> >
> > I'm trying to improve the parallel efficiency of gmres solve in my. In
> my CFD solver, Petsc gmres is used to solve the linear system generated by
> the Newton's method. To test its efficiency, I started with a very simple
> inviscid subsonic 3D flow as the first testcase. The parallel efficiency of
> gmres solve with asm as the preconditioner is very bad. The results are
> from our latest cluster. Right now, I'm only looking at the wclock time of
> the ksp_solve.
> > • First I tested ASM with gmres and ilu 0 for the sub domain , the
> cpu time of 2 cores is almost the same as the serial run. Here is the
> options for this case
> > -ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30
> > -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0
> > -sub_pc_factor_fill 1.9
> > The iteration numbers increase a lot for parallel run.
> > cores iterations err petsc solve wclock time speedup efficiency
> > 1 2 1.15E-04 11.95 1
> > 2 5 2.05E-02 10.5 1.01 0.50
> > 4 6 2.19E-02 7.64 1.39 0.34
> >
> >
> >
> >
> >
> >
> >
> > 2. Then I tested ASM with ilu 0 as the preconditoner only, the
> cpu time of 2 cores is better than the 1st test, but the speedup is still
> very bad. Here is the options i'm using
> > -ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0
> -sub_pc_factor_fill 1.9
> > cores iterations err petsc solve cpu time speedup efficiency
> > 1 10 4.54E-04 10.68 1
> > 2 11 9.55E-04 8.2 1.30 0.65
> > 4 12 3.59E-04 5.26 2.03 0.50
> >
> >
> >
> >
> >
> >
> >
> > Those results are from a third order "DG" scheme with a very coarse
> 3D mesh (480 elements). I believe I should get some speedups for this test
> even on this coarse mesh.
> >
> > My question is why does the asm with a local solve take much longer
> time than the asm as a preconditioner only? Also the accuracy is very bad
> too I have tested changing the overlap of asm to 2, but make it even worse.
> >
> > If I used a larger mesh ~4000 elements, the 2nd case with asm as the
> preconditioner gives me a better speedup, but still not very good.
> >
> > cores iterations err petsc solve cpu time speedup efficiency
> > 1 7 1.91E-02 97.32 1
> > 2 7 2.07E-02 64.94 1.5 0.74
> > 4 7 2.61E-02 36.97 2.6 0.65
> >
> >
> > Attached are the log_summary dumped from petsc, any suggestions are
> welcome. I really appreciate it.
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> > -- Norbert Wiener
> >
> >
> >
>
>
Sincerely Yours,
Lei Shi
---------
On Thu, Jun 25, 2015 at 5:33 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
> > On Jun 25, 2015, at 3:48 PM, Lei Shi <stoneszone at gmail.com> wrote:
> >
> > Hi Justin,
> >
> > Thanks for your suggestion. I will test it asap.
> >
> > Another thing confusing me is the wclock time with 2 cores is almost the
> same as the serial run when I use asm with sub_ksp_type gmres and ilu0 on
> subdomains. Serial run takes 11.95 sec and parallel run takes 10.5 sec.
> There is almost no speedup at all.
>
> On one process ASM is ilu(0), so the setup time is one ILU(0)
> factorization of the entire matrix. On two processes the ILU(0) is run on a
> matrix that is more than 1/2 the size of the matrix; due to the overlap of
> 1. In particular for small problems the overlap will pull in most of the
> matrix so the setup time is not 1/2 of the setup time of one process. Then
> the number of iterations increases a good amount in going from 1 to 2
> processes. In combination this means that ASM going from one to two process
> requires one each process much more than 1/2 the work of running on 1
> process so you should not expect great speedup in going from one to two
> processes.
>
>
>
> >
> > And I found some other people got similar bad speedups when comparing 2
> cores with 1 core. Attached is one slide from J.A. Davis's presentation. I
> just found it from the web. As you can see, asm with 2 cores takes almost
> the same cpu times compare 1 core too! May be I miss understanding some
> fundamental things related to asm.
> >
> > cores iterations err petsc solve wclock time speedup efficiency
> > 1 2 1.15E-04 11.95 1
> > 2 5 2.05E-02 10.5 1.01 0.50
> > 4 6 2.19E-02 7.64 1.39 0.34
> >
> >
> >
> >
> >
> >
> >
> > <Screenshot - 06252015 - 03:44:53 PM.png>
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 3:34 PM, Justin Chang <jychang48 at gmail.com>
> wrote:
> > Hi Lei,
> >
> > Depending on your machine and MPI library, you may have to use smart
> process to core/socket bindings to achieve better speedup. Instructions can
> be found here:
> >
> > http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
> >
> >
> > Justin
> >
> > On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hi Matt,
> >
> > Thanks for your suggestions. Here is the output from Stream test on one
> node which has 20 cores. I run it up to 20. Attached are the dumped output
> with your suggested options. Really appreciate your help!!!
> >
> > Number of MPI processes 1
> > Function Rate (MB/s)
> > Copy: 13816.9372
> > Scale: 8020.1809
> > Add: 12762.3830
> > Triad: 11852.5016
> >
> > Number of MPI processes 2
> > Function Rate (MB/s)
> > Copy: 22748.7681
> > Scale: 14081.4906
> > Add: 18998.4516
> > Triad: 18303.2494
> >
> > Number of MPI processes 3
> > Function Rate (MB/s)
> > Copy: 34045.2510
> > Scale: 23410.9767
> > Add: 30320.2702
> > Triad: 30163.7977
> >
> > Number of MPI processes 4
> > Function Rate (MB/s)
> > Copy: 36875.5349
> > Scale: 29440.1694
> > Add: 36971.1860
> > Triad: 37377.0103
> >
> > Number of MPI processes 5
> > Function Rate (MB/s)
> > Copy: 32272.8763
> > Scale: 30316.3435
> > Add: 38022.0193
> > Triad: 38815.4830
> >
> > Number of MPI processes 6
> > Function Rate (MB/s)
> > Copy: 35619.8925
> > Scale: 34457.5078
> > Add: 41419.3722
> > Triad: 35825.3621
> >
> > Number of MPI processes 7
> > Function Rate (MB/s)
> > Copy: 55284.2420
> > Scale: 47706.8009
> > Add: 59076.4735
> > Triad: 61680.5559
> >
> > Number of MPI processes 8
> > Function Rate (MB/s)
> > Copy: 44525.8901
> > Scale: 48949.9599
> > Add: 57437.7784
> > Triad: 56671.0593
> >
> > Number of MPI processes 9
> > Function Rate (MB/s)
> > Copy: 34375.7364
> > Scale: 29507.5293
> > Add: 45405.3120
> > Triad: 39518.7559
> >
> > Number of MPI processes 10
> > Function Rate (MB/s)
> > Copy: 34278.0415
> > Scale: 41721.7843
> > Add: 46642.2465
> > Triad: 45454.7000
> >
> > Number of MPI processes 11
> > Function Rate (MB/s)
> > Copy: 38093.7244
> > Scale: 35147.2412
> > Add: 45047.0853
> > Triad: 44983.2013
> >
> > Number of MPI processes 12
> > Function Rate (MB/s)
> > Copy: 39750.8760
> > Scale: 52038.0631
> > Add: 55552.9503
> > Triad: 54884.3839
> >
> > Number of MPI processes 13
> > Function Rate (MB/s)
> > Copy: 60839.0248
> > Scale: 74143.7458
> > Add: 85545.3135
> > Triad: 85667.6551
> >
> > Number of MPI processes 14
> > Function Rate (MB/s)
> > Copy: 37766.2343
> > Scale: 40279.1928
> > Add: 49992.8572
> > Triad: 50303.4809
> >
> > Number of MPI processes 15
> > Function Rate (MB/s)
> > Copy: 49762.3670
> > Scale: 59077.8251
> > Add: 60407.9651
> > Triad: 61691.9456
> >
> > Number of MPI processes 16
> > Function Rate (MB/s)
> > Copy: 31996.7169
> > Scale: 36962.4860
> > Add: 40183.5060
> > Triad: 41096.0512
> >
> > Number of MPI processes 17
> > Function Rate (MB/s)
> > Copy: 36348.3839
> > Scale: 39108.6761
> > Add: 46853.4476
> > Triad: 47266.1778
> >
> > Number of MPI processes 18
> > Function Rate (MB/s)
> > Copy: 40438.7558
> > Scale: 43195.5785
> > Add: 53063.4321
> > Triad: 53605.0293
> >
> > Number of MPI processes 19
> > Function Rate (MB/s)
> > Copy: 30739.4908
> > Scale: 34280.8118
> > Add: 40710.5155
> > Triad: 43330.9503
> >
> > Number of MPI processes 20
> > Function Rate (MB/s)
> > Copy: 37488.3777
> > Scale: 41791.8999
> > Add: 49518.9604
> > Triad: 48908.2677
> > ------------------------------------------------
> > np speedup
> > 1 1.0
> > 2 1.54
> > 3 2.54
> > 4 3.15
> > 5 3.27
> > 6 3.02
> > 7 5.2
> > 8 4.78
> > 9 3.33
> > 10 3.84
> > 11 3.8
> > 12 4.63
> > 13 7.23
> > 14 4.24
> > 15 5.2
> > 16 3.47
> > 17 3.99
> > 18 4.52
> > 19 3.66
> > 20 4.13
> >
> >
> >
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> > On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com>
> wrote:
> > On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
> > Hello,
> >
> > 1) In order to understand this, we have to disentagle the various
> effect. First, run the STREAMS benchmark
> >
> > make NPMAX=4 streams
> >
> > This will tell you the maximum speedup you can expect on this machine.
> >
> > 2) For these test cases, also send the output of
> >
> > -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
> >
> > Thanks,
> >
> > Matt
> >
> > I'm trying to improve the parallel efficiency of gmres solve in my. In
> my CFD solver, Petsc gmres is used to solve the linear system generated by
> the Newton's method. To test its efficiency, I started with a very simple
> inviscid subsonic 3D flow as the first testcase. The parallel efficiency of
> gmres solve with asm as the preconditioner is very bad. The results are
> from our latest cluster. Right now, I'm only looking at the wclock time of
> the ksp_solve.
> > • First I tested ASM with gmres and ilu 0 for the sub domain , the
> cpu time of 2 cores is almost the same as the serial run. Here is the
> options for this case
> > -ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30
> > -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0
> > -sub_pc_factor_fill 1.9
> > The iteration numbers increase a lot for parallel run.
> > cores iterations err petsc solve wclock time speedup efficiency
> > 1 2 1.15E-04 11.95 1
> > 2 5 2.05E-02 10.5 1.01 0.50
> > 4 6 2.19E-02 7.64 1.39 0.34
> >
> >
> >
> >
> >
> >
> >
> > 2. Then I tested ASM with ilu 0 as the preconditoner only, the
> cpu time of 2 cores is better than the 1st test, but the speedup is still
> very bad. Here is the options i'm using
> > -ksp_type gmres -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
> > -ksp_gmres_restart 30 -ksp_pc_side right
> > -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0
> -sub_pc_factor_fill 1.9
> > cores iterations err petsc solve cpu time speedup efficiency
> > 1 10 4.54E-04 10.68 1
> > 2 11 9.55E-04 8.2 1.30 0.65
> > 4 12 3.59E-04 5.26 2.03 0.50
> >
> >
> >
> >
> >
> >
> >
> > Those results are from a third order "DG" scheme with a very coarse
> 3D mesh (480 elements). I believe I should get some speedups for this test
> even on this coarse mesh.
> >
> > My question is why does the asm with a local solve take much longer
> time than the asm as a preconditioner only? Also the accuracy is very bad
> too I have tested changing the overlap of asm to 2, but make it even worse.
> >
> > If I used a larger mesh ~4000 elements, the 2nd case with asm as the
> preconditioner gives me a better speedup, but still not very good.
> >
> > cores iterations err petsc solve cpu time speedup efficiency
> > 1 7 1.91E-02 97.32 1
> > 2 7 2.07E-02 64.94 1.5 0.74
> > 4 7 2.61E-02 36.97 2.6 0.65
> >
> >
> > Attached are the log_summary dumped from petsc, any suggestions are
> welcome. I really appreciate it.
> >
> >
> > Sincerely Yours,
> >
> > Lei Shi
> > ---------
> >
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> > -- Norbert Wiener
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/9e0a5848/attachment-0001.html>
More information about the petsc-users
mailing list