[petsc-users] Parallel efficiency of the gmres solver with ASM

Thu Jun 25 17:33:33 CDT 2015

> On Jun 25, 2015, at 3:48 PM, Lei Shi <stoneszone at gmail.com> wrote:
> 
> Hi Justin,
> 
> Thanks for your suggestion. I will test it asap. 
> 
> Another thing confusing me is the wclock time with 2 cores is almost the same as the serial run when I use asm with sub_ksp_type gmres and ilu0 on subdomains. Serial run takes 11.95 sec and parallel run takes 10.5 sec. There is almost no speedup at all. 

   On one process ASM is ilu(0), so the setup time is one ILU(0) factorization of the entire matrix. On two processes the ILU(0) is run on a matrix that is more than 1/2 the size of the matrix; due to the overlap of 1. In particular for small problems the overlap will pull in most of the matrix so the setup time is not 1/2 of the setup time of one process. Then the number of iterations increases a good amount in going from 1 to 2 processes. In combination this means that ASM going from one to two process requires one each process much more than 1/2 the work of running on 1 process so you should not expect great speedup in going from one to two processes.

> 
> And I found some other people got similar bad speedups when comparing 2 cores with 1 core. Attached is one slide from J.A. Davis's presentation. I just found it from the web. As you can see, asm with 2 cores takes almost the same cpu times compare 1 core too! May be I miss understanding some fundamental things related to asm.  
> 
> cores	iterations	err	petsc solve wclock time	speedup	efficiency
> 1	2	1.15E-04	11.95	1	
> 2	5	2.05E-02	10.5	1.01	0.50
> 4	6	2.19E-02	7.64	1.39	0.34
> 
> 
> 
> 
> 
> 
> 
> <Screenshot - 06252015 - 03:44:53 PM.png>
> 
> 
> Sincerely Yours,
> 
> Lei Shi 
> ---------
> 
> On Thu, Jun 25, 2015 at 3:34 PM, Justin Chang <jychang48 at gmail.com> wrote:
> Hi Lei,
> 
> Depending on your machine and MPI library, you may have to use smart process to core/socket bindings to achieve better speedup. Instructions can be found here: 
> 
> http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
> 
> 
> Justin
> 
> On Thu, Jun 25, 2015 at 3:24 PM, Lei Shi <stoneszone at gmail.com> wrote:
> Hi Matt,
> 
> Thanks for your suggestions. Here is the output from Stream test on one node which has 20 cores. I run it up to 20. Attached are the dumped output with your suggested options. Really appreciate your help!!! 
> 
> Number of MPI processes 1
> Function      Rate (MB/s) 
> Copy:       13816.9372
> Scale:       8020.1809
> Add:        12762.3830
> Triad:      11852.5016
> 
> Number of MPI processes 2
> Function      Rate (MB/s) 
> Copy:       22748.7681
> Scale:      14081.4906
> Add:        18998.4516
> Triad:      18303.2494
> 
> Number of MPI processes 3
> Function      Rate (MB/s) 
> Copy:       34045.2510
> Scale:      23410.9767
> Add:        30320.2702
> Triad:      30163.7977
> 
> Number of MPI processes 4
> Function      Rate (MB/s) 
> Copy:       36875.5349
> Scale:      29440.1694
> Add:        36971.1860
> Triad:      37377.0103
> 
> Number of MPI processes 5
> Function      Rate (MB/s) 
> Copy:       32272.8763
> Scale:      30316.3435
> Add:        38022.0193
> Triad:      38815.4830
> 
> Number of MPI processes 6
> Function      Rate (MB/s) 
> Copy:       35619.8925
> Scale:      34457.5078
> Add:        41419.3722
> Triad:      35825.3621
> 
> Number of MPI processes 7
> Function      Rate (MB/s) 
> Copy:       55284.2420
> Scale:      47706.8009
> Add:        59076.4735
> Triad:      61680.5559
> 
> Number of MPI processes 8
> Function      Rate (MB/s) 
> Copy:       44525.8901
> Scale:      48949.9599
> Add:        57437.7784
> Triad:      56671.0593
> 
> Number of MPI processes 9
> Function      Rate (MB/s) 
> Copy:       34375.7364
> Scale:      29507.5293
> Add:        45405.3120
> Triad:      39518.7559
> 
> Number of MPI processes 10
> Function      Rate (MB/s) 
> Copy:       34278.0415
> Scale:      41721.7843
> Add:        46642.2465
> Triad:      45454.7000
> 
> Number of MPI processes 11
> Function      Rate (MB/s) 
> Copy:       38093.7244
> Scale:      35147.2412
> Add:        45047.0853
> Triad:      44983.2013
> 
> Number of MPI processes 12
> Function      Rate (MB/s) 
> Copy:       39750.8760
> Scale:      52038.0631
> Add:        55552.9503
> Triad:      54884.3839
> 
> Number of MPI processes 13
> Function      Rate (MB/s) 
> Copy:       60839.0248
> Scale:      74143.7458
> Add:        85545.3135
> Triad:      85667.6551
> 
> Number of MPI processes 14
> Function      Rate (MB/s) 
> Copy:       37766.2343
> Scale:      40279.1928
> Add:        49992.8572
> Triad:      50303.4809
> 
> Number of MPI processes 15
> Function      Rate (MB/s) 
> Copy:       49762.3670
> Scale:      59077.8251
> Add:        60407.9651
> Triad:      61691.9456
> 
> Number of MPI processes 16
> Function      Rate (MB/s) 
> Copy:       31996.7169
> Scale:      36962.4860
> Add:        40183.5060
> Triad:      41096.0512
> 
> Number of MPI processes 17
> Function      Rate (MB/s) 
> Copy:       36348.3839
> Scale:      39108.6761
> Add:        46853.4476
> Triad:      47266.1778
> 
> Number of MPI processes 18
> Function      Rate (MB/s) 
> Copy:       40438.7558
> Scale:      43195.5785
> Add:        53063.4321
> Triad:      53605.0293
> 
> Number of MPI processes 19
> Function      Rate (MB/s) 
> Copy:       30739.4908
> Scale:      34280.8118
> Add:        40710.5155
> Triad:      43330.9503
> 
> Number of MPI processes 20
> Function      Rate (MB/s) 
> Copy:       37488.3777
> Scale:      41791.8999
> Add:        49518.9604
> Triad:      48908.2677
> ------------------------------------------------
> np  speedup
> 1 1.0
> 2 1.54
> 3 2.54
> 4 3.15
> 5 3.27
> 6 3.02
> 7 5.2
> 8 4.78
> 9 3.33
> 10 3.84
> 11 3.8
> 12 4.63
> 13 7.23
> 14 4.24
> 15 5.2
> 16 3.47
> 17 3.99
> 18 4.52
> 19 3.66
> 20 4.13
> 
> 
> 
> 
> 
> Sincerely Yours,
> 
> Lei Shi 
> ---------
> 
> On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com> wrote:
> On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
> Hello,
> 
> 1) In order to understand this, we have to disentagle the various effect. First, run the STREAMS benchmark
> 
>   make NPMAX=4 streams
> 
> This will tell you the maximum speedup you can expect on this machine.
> 
> 2) For these test cases, also send the output of
> 
>   -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
> 
>   Thanks,
> 
>      Matt
>  
> I'm trying to improve the parallel efficiency of gmres solve in my. In my CFD solver, Petsc gmres is used to solve the linear system generated by the Newton's method. To test its efficiency, I started with a very simple inviscid subsonic 3D flow as the first testcase. The parallel efficiency of gmres solve with asm as the preconditioner is very bad. The results are from our latest cluster. Right now, I'm only looking at the wclock time of the ksp_solve.
> 	• First I tested ASM with gmres and ilu 0 for the sub domain , the cpu time of 2 cores is almost the same as the serial run. Here is the options for this case
> -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50 
> -ksp_gmres_restart 30 -ksp_pc_side right
> -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30
> -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0 
> -sub_pc_factor_fill 1.9
> The iteration numbers increase a lot for parallel run.
> cores	iterations	err	petsc solve wclock time	speedup	efficiency
> 1	2	1.15E-04	11.95	1	
> 2	5	2.05E-02	10.5	1.01	0.50
> 4	6	2.19E-02	7.64	1.39	0.34
> 
> 
> 
> 
> 
> 
>      
>       2.  Then I tested ASM with ilu 0 as the preconditoner only, the cpu time of 2 cores is better than the 1st test, but the speedup is still very bad. Here is the options i'm using
> -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50 
> -ksp_gmres_restart 30 -ksp_pc_side right
> -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0  -sub_pc_factor_fill 1.9
> cores	iterations	err	petsc solve cpu time	speedup	efficiency
> 1	10	4.54E-04	10.68	1	
> 2	11	9.55E-04	8.2	1.30	0.65
> 4	12	3.59E-04	5.26	2.03	0.50
> 
> 
> 
> 
> 
> 
> 
>    Those results are from a third order "DG" scheme with a very coarse 3D mesh (480 elements). I believe I should get some speedups for this test even on this coarse mesh. 
> 
>   My question is why does the asm with a local solve take much longer time than the asm as a preconditioner only? Also the accuracy is very bad too I have tested changing the overlap of asm to 2, but make it even worse.
> 
>   If I used a larger mesh ~4000 elements, the 2nd case with asm as the preconditioner gives me a better speedup, but still not very good. 
> 
> cores	iterations	err	petsc solve cpu time	speedup	efficiency
> 1	7	1.91E-02	97.32	1	
> 2	7	2.07E-02	64.94	1.5	0.74
> 4	7	2.61E-02	36.97	2.6	0.65
> 
> 
> Attached are the log_summary dumped from petsc, any suggestions are welcome. I really appreciate it.
> 
> 
> Sincerely Yours,
> 
> Lei Shi 
> ---------
> 
> 
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
> -- Norbert Wiener
> 
> 
>