[petsc-users] Why the convergence is much slower when I use two nodes

Barry Smith bsmith at mcs.anl.gov
Tue Jun 13 09:41:06 CDT 2017


Before we worry about time we need to figure out why the MPI parallel jobs have different final residual norms. Given that you have no preconditioner the residual histories for different number of processes should be very similar. 

Run on one and two MPI processes with the option -ksp_monitor_true_residual and send the output. 

Perhaps there is a big in the matrix generation in parallel so it does not produce the same matrix as when run sequentially.

Barry



> On Jun 13, 2017, at 7:38 AM, Ji Zhang <gotofd at gmail.com> wrote:
> 
> mpirun -n 1 -hostfile hostfile python force_pipe.py -dp 0.05 -ep 2 -lp 2 -b0 1e-4 -b1 0.9 -nb 1 -th 30 -ksp_max_it 1000 -ksp_view -log_view > mpi_1.txt
> mpirun -n 2 -hostfile hostfile python force_pipe.py -dp 0.05 -ep 2 -lp 2 -b0 1e-4 -b1 0.9 -nb 1 -th 30 -ksp_max_it 1000 -ksp_view -log_view > mpi_2.txt
> mpirun -n 4 -hostfile hostfile python force_pipe.py -dp 0.05 -ep 2 -lp 2 -b0 1e-4 -b1 0.9 -nb 1 -th 30 -ksp_max_it 1000 -ksp_view -log_view > mpi_3.txt
> mpirun -n 6 -hostfile hostfile python force_pipe.py -dp 0.05 -ep 2 -lp 2 -b0 1e-4 -b1 0.9 -nb 1 -th 30 -ksp_max_it 1000 -ksp_view -log_view > mpi_4.txt
> mpirun -n 2 python force_pipe.py -dp 0.05 -ep 2 -lp 2 -b0 1e-4 -b1 0.9 -nb 1 -th 30 -ksp_max_it 1000 -ksp_view -log_view > mpi_5.txt
> mpirun -n 4 python force_pipe.py -dp 0.05 -ep 2 -lp 2 -b0 1e-4 -b1 0.9 -nb 1 -th 30 -ksp_max_it 1000 -ksp_view -log_view > mpi_6.txt
> mpirun -n 6 python force_pipe.py -dp 0.05 -ep 2 -lp 2 -b0 1e-4 -b1 0.9 -nb 1 -th 30 -ksp_max_it 1000 -ksp_view -log_view > mpi_7.txt
> 
> Dear Barry,
> 
> The following tests are runing in our cluster using one, two or three nodes. Each node have 64GB memory and 24 cups (Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz). Basic information of each node are listed below. 
> 
> $ lstopo 
> Machine (64GB)
>   NUMANode L#0 (P#0 32GB)
>     Socket L#0 + L3 L#0 (30MB)
>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
>       L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
>       L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
>       L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
>       L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
>     HostBridge L#0
>       PCIBridge
>         PCI 1000:0097
>           Block L#0 "sda"
>       PCIBridge
>         PCI 8086:1523
>           Net L#1 "eth0"
>         PCI 8086:1523
>           Net L#2 "eth1"
>       PCIBridge
>         PCIBridge
>           PCI 1a03:2000
>       PCI 8086:8d02
>   NUMANode L#1 (P#1 32GB)
>     Socket L#1 + L3 L#1 (30MB)
>       L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
>       L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
>       L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
>       L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
>       L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
>       L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
>       L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
>       L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
>       L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
>       L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
>       L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
>       L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
>     HostBridge L#5
>       PCIBridge
>         PCI 15b3:1003
>           Net L#3 "ib0"
>           OpenFabrics L#4 "mlx4_0"
> 
> I have tested seven different cases. Each case solving three different linear equation systems A*x1=b1, A*x2=b2, A*x3=b3. The matrix A is a mpidense matrix and b1, b2, b3 are different vectors. 
> I'm using GMRES method without precondition method . I have set -ksp_mat_it 1000
>            process  nodes  eq1_residual_norms eq1_duration eq2_residual_norms eq2_duration eq3_residual_norms eq3_duration
> mpi_1.txt: 1        1      9.884635e-04       88.631310s   4.144572e-04       88.855811s   4.864481e-03       88.673738s
> mpi_2.txt: 2        2      6.719300e-01       84.212435s   6.782443e-01       85.153371s   7.223828e-01       85.246724s
> mpi_3.txt: 4        4      5.813354e-01       52.616490s   5.397962e-01       52.413213s   9.503432e-01       52.495871s
> mpi_4.txt: 6        6      4.621066e-01       42.929705s   4.661823e-01       43.367914s   1.047436e+00       43.108877s
> mpi_5.txt: 2        1      6.719300e-01      141.490945s   6.782443e-01      142.746243s   7.223828e-01      142.042608s
> mpi_6.txt: 3        1      5.813354e-01      165.061162s   5.397962e-01      196.539286s   9.503432e-01      180.240947s
> mpi_7.txt: 6        1      4.621066e-01      213.683270s   4.661823e-01      208.180939s   1.047436e+00      194.251886s
> I found that all residual norms are on the order of 1 except the first case, which one I only use one process at one node. 
> See the attach files for more details, please. 
> 
> 
> 此致
>     敬礼
> 张骥(博士研究生)
> 北京计算科学研究中心 
> 北京市海淀区西北旺东路10号院东区9号楼 (100193)
> 
> Best, 
> Regards, 
> Zhang Ji, PhD student
> Beijing Computational Science Research Center 
> Zhongguancun Software Park II, No. 10 Dongbeiwang West Road, Haidian District, Beijing 100193, China 
> 
> On Tue, Jun 13, 2017 at 9:34 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>    You need to provide more information. What is the output of -ksp_view? and -log_view? for both cases
> 
> > On Jun 12, 2017, at 7:11 PM, Ji Zhang <gotofd at gmail.com> wrote:
> >
> > Dear all,
> >
> > I'm a PETSc user. I'm using GMRES method to solve some linear equations. I'm using boundary element method, so the matrix type is dense (or mpidense). I'm using MPICH2, I found that the convergence is fast if I only use one computer node; and much more slower if I use two or more nodes. I'm interested in why this happen, and how can I improve the convergence performance when I use multi-nodes.
> >
> > Thanks a lot.
> >
> > 此致
> >     敬礼
> > 张骥(博士研究生)
> > 北京计算科学研究中心
> > 北京市海淀区西北旺东路10号院东区9号楼 (100193)
> >
> > Best,
> > Regards,
> > Zhang Ji, PhD student
> > Beijing Computational Science Research Center
> > Zhongguancun Software Park II, No. 10 Dongbeiwang West Road, Haidian District, Beijing 100193, China
> 
> 
> <mpi_7.txt><mpi_6.txt><mpi_5.txt><mpi_4.txt><mpi_3.txt><mpi_2.txt><mpi_1.txt>



More information about the petsc-users mailing list