[petsc-users] Parallel efficiency of the gmres solver with ASM

Thu Jun 25 15:24:29 CDT 2015

Hi Matt,

Thanks for your suggestions. Here is the output from Stream test on one
node which has 20 cores. I run it up to 20. Attached are the dumped output
with your suggested options. Really appreciate your help!!!

Number of MPI processes 1
Function      Rate (MB/s)
Copy:       13816.9372
Scale:       8020.1809
Add:        12762.3830
Triad:      11852.5016

Number of MPI processes 2
Function      Rate (MB/s)
Copy:       22748.7681
Scale:      14081.4906
Add:        18998.4516
Triad:      18303.2494

Number of MPI processes 3
Function      Rate (MB/s)
Copy:       34045.2510
Scale:      23410.9767
Add:        30320.2702
Triad:      30163.7977

Number of MPI processes 4
Function      Rate (MB/s)
Copy:       36875.5349
Scale:      29440.1694
Add:        36971.1860
Triad:      37377.0103

Number of MPI processes 5
Function      Rate (MB/s)
Copy:       32272.8763
Scale:      30316.3435
Add:        38022.0193
Triad:      38815.4830

Number of MPI processes 6
Function      Rate (MB/s)
Copy:       35619.8925
Scale:      34457.5078
Add:        41419.3722
Triad:      35825.3621

Number of MPI processes 7
Function      Rate (MB/s)
Copy:       55284.2420
Scale:      47706.8009
Add:        59076.4735
Triad:      61680.5559

Number of MPI processes 8
Function      Rate (MB/s)
Copy:       44525.8901
Scale:      48949.9599
Add:        57437.7784
Triad:      56671.0593

Number of MPI processes 9
Function      Rate (MB/s)
Copy:       34375.7364
Scale:      29507.5293
Add:        45405.3120
Triad:      39518.7559

Number of MPI processes 10
Function      Rate (MB/s)
Copy:       34278.0415
Scale:      41721.7843
Add:        46642.2465
Triad:      45454.7000

Number of MPI processes 11
Function      Rate (MB/s)
Copy:       38093.7244
Scale:      35147.2412
Add:        45047.0853
Triad:      44983.2013

Number of MPI processes 12
Function      Rate (MB/s)
Copy:       39750.8760
Scale:      52038.0631
Add:        55552.9503
Triad:      54884.3839

Number of MPI processes 13
Function      Rate (MB/s)
Copy:       60839.0248
Scale:      74143.7458
Add:        85545.3135
Triad:      85667.6551

Number of MPI processes 14
Function      Rate (MB/s)
Copy:       37766.2343
Scale:      40279.1928
Add:        49992.8572
Triad:      50303.4809

Number of MPI processes 15
Function      Rate (MB/s)
Copy:       49762.3670
Scale:      59077.8251
Add:        60407.9651
Triad:      61691.9456

Number of MPI processes 16
Function      Rate (MB/s)
Copy:       31996.7169
Scale:      36962.4860
Add:        40183.5060
Triad:      41096.0512

Number of MPI processes 17
Function      Rate (MB/s)
Copy:       36348.3839
Scale:      39108.6761
Add:        46853.4476
Triad:      47266.1778

Number of MPI processes 18
Function      Rate (MB/s)
Copy:       40438.7558
Scale:      43195.5785
Add:        53063.4321
Triad:      53605.0293

Number of MPI processes 19
Function      Rate (MB/s)
Copy:       30739.4908
Scale:      34280.8118
Add:        40710.5155
Triad:      43330.9503

Number of MPI processes 20
Function      Rate (MB/s)
Copy:       37488.3777
Scale:      41791.8999
Add:        49518.9604
Triad:      48908.2677
------------------------------------------------
np  speedup
1 1.0
2 1.54
3 2.54
4 3.15
5 3.27
6 3.02
7 5.2
8 4.78
9 3.33
10 3.84
11 3.8
12 4.63
13 7.23
14 4.24
15 5.2
16 3.47
17 3.99
18 4.52
19 3.66
20 4.13

Sincerely Yours,

Lei Shi
---------

On Thu, Jun 25, 2015 at 6:44 AM, Matthew Knepley <knepley at gmail.com> wrote:

> On Thu, Jun 25, 2015 at 5:51 AM, Lei Shi <stoneszone at gmail.com> wrote:
>
>> Hello,
>>
>
> 1) In order to understand this, we have to disentagle the various effect.
> First, run the STREAMS benchmark
>
>   make NPMAX=4 streams
>
> This will tell you the maximum speedup you can expect on this machine.
>
> 2) For these test cases, also send the output of
>
>   -ksp_view -ksp_converged_reason -ksp_monitor_true_residual
>
>   Thanks,
>
>      Matt
>
>
>> I'm trying to improve the parallel efficiency of gmres solve in my. In my
>> CFD solver, Petsc gmres is used to solve the linear system generated by the
>> Newton's method. To test its efficiency, I started with a very simple
>> inviscid subsonic 3D flow as the first testcase. The parallel efficiency of
>> gmres solve with asm as the preconditioner is very bad. The results are
>> from our latest cluster. Right now, I'm only looking at the wclock time of
>> the ksp_solve.
>>
>>    1. First I tested ASM with gmres and ilu 0 for the sub domain , the
>>    cpu time of 2 cores is almost the same as the serial run. Here is the
>>    options for this case
>>
>> -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
>> -ksp_gmres_restart 30 -ksp_pc_side right
>> -pc_type asm -sub_ksp_type gmres -sub_ksp_rtol 0.001 -sub_ksp_atol 1e-30
>> -sub_ksp_max_it 1000 -sub_pc_type ilu -sub_pc_factor_levels 0
>> -sub_pc_factor_fill 1.9
>>
>> The iteration numbers increase a lot for parallel run.
>>
>> coresiterationserrpetsc solve wclock timespeedupefficiency121.15E-0411.95
>> 1252.05E-0210.51.010.50462.19E-027.641.390.34
>>
>>
>>
>>
>>
>>
>>       2.  Then I tested ASM with ilu 0 as the preconditoner only, the cpu
>> time of 2 cores is better than the 1st test, but the speedup is still very
>> bad. Here is the options i'm using
>>
>> -ksp_type gmres  -ksp_max_it 100 -ksp_rtol 1e-5 -ksp_atol 1e-50
>> -ksp_gmres_restart 30 -ksp_pc_side right
>> -pc_type asm -sub_pc_type ilu -sub_pc_factor_levels 0
>>  -sub_pc_factor_fill 1.9
>>
>> coresiterationserrpetsc solve cpu timespeedupefficiency1104.54E-0410.6812
>> 119.55E-048.21.300.654123.59E-045.262.030.50
>>
>>
>>
>>
>>
>>
>>    Those results are from a third order "DG" scheme with a very coarse 3D
>> mesh (480 elements). I believe I should get some speedups for this test
>> even on this coarse mesh.
>>
>>   My question is why does the asm with a local solve take much longer
>> time than the asm as a preconditioner only? Also the accuracy is very bad
>> too I have tested changing the overlap of asm to 2, but make it even worse.
>>
>>   If I used a larger mesh ~4000 elements, the 2nd case with asm as the
>> preconditioner gives me a better speedup, but still not very good.
>>
>>
>> coresiterationserrpetsc solve cpu timespeedupefficiency171.91E-0297.32127
>> 2.07E-0264.941.50.74472.61E-0236.972.60.65
>>
>>
>>
>> Attached are the log_summary dumped from petsc, any suggestions are
>> welcome. I really appreciate it.
>>
>>
>> Sincerely Yours,
>>
>> Lei Shi
>> ---------
>>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/96fc7829/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: proc2_asm_sub_ksp.dat
Type: application/octet-stream
Size: 12839 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/96fc7829/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: proc2_asm_pconly.dat
Type: application/octet-stream
Size: 13347 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/96fc7829/attachment-0005.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: proc1_asm_sub_ksp.dat
Type: application/octet-stream
Size: 12323 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/96fc7829/attachment-0006.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: proc1_asm_pconly.dat
Type: application/octet-stream
Size: 13066 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150625/96fc7829/attachment-0007.obj>