[petsc-users] Why the convergence is much slower when I use two nodes

Tue Jun 13 07:38:17 CDT 2017

mpirun -n 1 -hostfile hostfile python force_pipe.py -dp 0.05 -ep 2 -lp 2
-b0 1e-4 -b1 0.9 -nb 1 -th 30 -ksp_max_it 1000 -ksp_view -log_view >
mpi_1.txt
mpirun -n 2 -hostfile hostfile python force_pipe.py -dp 0.05 -ep 2 -lp 2
-b0 1e-4 -b1 0.9 -nb 1 -th 30 -ksp_max_it 1000 -ksp_view -log_view >
mpi_2.txt
mpirun -n 4 -hostfile hostfile python force_pipe.py -dp 0.05 -ep 2 -lp 2
-b0 1e-4 -b1 0.9 -nb 1 -th 30 -ksp_max_it 1000 -ksp_view -log_view >
mpi_3.txt
mpirun -n 6 -hostfile hostfile python force_pipe.py -dp 0.05 -ep 2 -lp 2
-b0 1e-4 -b1 0.9 -nb 1 -th 30 -ksp_max_it 1000 -ksp_view -log_view >
mpi_4.txt
mpirun -n 2 python force_pipe.py -dp 0.05 -ep 2 -lp 2 -b0 1e-4 -b1 0.9 -nb
1 -th 30 -ksp_max_it 1000 -ksp_view -log_view > mpi_5.txt
mpirun -n 4 python force_pipe.py -dp 0.05 -ep 2 -lp 2 -b0 1e-4 -b1 0.9 -nb
1 -th 30 -ksp_max_it 1000 -ksp_view -log_view > mpi_6.txt
mpirun -n 6 python force_pipe.py -dp 0.05 -ep 2 -lp 2 -b0 1e-4 -b1 0.9 -nb
1 -th 30 -ksp_max_it 1000 -ksp_view -log_view > mpi_7.txt

Dear Barry,

The following tests are runing in our cluster using one, two or three
nodes. Each node have 64GB memory and 24 cups (Intel(R) Xeon(R) CPU E5-2680
v3 @ 2.50GHz). Basic information of each node are listed below.

$ lstopo
Machine (64GB)
  NUMANode L#0 (P#0 32GB)
    Socket L#0 + L3 L#0 (30MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0
(P#0)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1
(P#1)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2
(P#2)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3
(P#3)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4
(P#4)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5
(P#5)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6
(P#6)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7
(P#7)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8
(P#8)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9
(P#9)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU
L#10 (P#10)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU
L#11 (P#11)
    HostBridge L#0
      PCIBridge
        PCI 1000:0097
          Block L#0 "sda"
      PCIBridge
        PCI 8086:1523
          Net L#1 "eth0"
        PCI 8086:1523
          Net L#2 "eth1"
      PCIBridge
        PCIBridge
          PCI 1a03:2000
      PCI 8086:8d02
  NUMANode L#1 (P#1 32GB)
    Socket L#1 + L3 L#1 (30MB)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU
L#12 (P#12)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU
L#13 (P#13)
      L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU
L#14 (P#14)
      L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU
L#15 (P#15)
      L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU
L#16 (P#16)
      L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU
L#17 (P#17)
      L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU
L#18 (P#18)
      L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU
L#19 (P#19)
      L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU
L#20 (P#20)
      L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU
L#21 (P#21)
      L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU
L#22 (P#22)
      L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU
L#23 (P#23)
    HostBridge L#5
      PCIBridge
        PCI 15b3:1003
          Net L#3 "ib0"
          OpenFabrics L#4 "mlx4_0"

I have tested seven different cases. Each case solving three different
linear equation systems A*x1=b1, A*x2=b2, A*x3=b3. The matrix A is a
mpidense matrix and b1, b2, b3 are different vectors.
I'm using GMRES method without precondition method . I have set -ksp_mat_it
1000
           process  nodes  eq1_residual_norms eq1_duration
eq2_residual_norms eq2_duration eq3_residual_norms eq3_duration
mpi_1.txt: 1        1      9.884635e-04       88.631310s   4.144572e-04
  88.855811s   4.864481e-03       88.673738s
mpi_2.txt: 2        2      6.719300e-01       84.212435s   6.782443e-01
  85.153371s   7.223828e-01       85.246724s
mpi_3.txt: 4        4      5.813354e-01       52.616490s   5.397962e-01
  52.413213s   9.503432e-01       52.495871s
mpi_4.txt: 6        6      4.621066e-01       42.929705s   4.661823e-01
  43.367914s   1.047436e+00       43.108877s
mpi_5.txt: 2        1      6.719300e-01      141.490945s   6.782443e-01
 142.746243s   7.223828e-01      142.042608s
mpi_6.txt: 3        1      5.813354e-01      165.061162s   5.397962e-01
 196.539286s   9.503432e-01      180.240947s
mpi_7.txt: 6        1      4.621066e-01      213.683270s   4.661823e-01
 208.180939s   1.047436e+00      194.251886s
I found that all residual norms are on the order of 1 except the first
case, which one I only use one process at one node.
See the attach files for more details, please.

此致
    敬礼
张骥（博士研究生）
北京计算科学研究中心
北京市海淀区西北旺东路10号院东区9号楼 （100193）

Best,
Regards,
Zhang Ji, PhD student
Beijing Computational Science Research Center
Zhongguancun Software Park II, No. 10 Dongbeiwang West Road, Haidian
District, Beijing 100193, China

On Tue, Jun 13, 2017 at 9:34 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
>    You need to provide more information. What is the output of -ksp_view?
> and -log_view? for both cases
>
> > On Jun 12, 2017, at 7:11 PM, Ji Zhang <gotofd at gmail.com> wrote:
> >
> > Dear all,
> >
> > I'm a PETSc user. I'm using GMRES method to solve some linear equations.
> I'm using boundary element method, so the matrix type is dense (or
> mpidense). I'm using MPICH2, I found that the convergence is fast if I only
> use one computer node; and much more slower if I use two or more nodes. I'm
> interested in why this happen, and how can I improve the convergence
> performance when I use multi-nodes.
> >
> > Thanks a lot.
> >
> > 此致
> >     敬礼
> > 张骥（博士研究生）
> > 北京计算科学研究中心
> > 北京市海淀区西北旺东路10号院东区9号楼 （100193）
> >
> > Best,
> > Regards,
> > Zhang Ji, PhD student
> > Beijing Computational Science Research Center
> > Zhongguancun Software Park II, No. 10 Dongbeiwang West Road, Haidian
> District, Beijing 100193, China
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170613/07ed146c/attachment-0001.html>
-------------- next part --------------
Case information: 
  pipe length: 2.000000, pipe radius: 1.000000
  delta length of pipe is 0.050000, epsilon of pipe is  2.000000
  threshold of seriers is 30
  b: 1 numbers are evenly distributed within the range [0.000100, 0.900000]
  create matrix method: pf_stokesletsInPipe 
  solve method: gmres, precondition method: none
  output file headle: force_pipe
MPI size: 6
Stokeslets in pipe prepare, contain 7376 nodes
  create matrix use 3.737578s:
  _00001/00001_b=0.000100:    calculate boundary condation use: 1.798243s
KSP Object: 6 MPI processes
  type: gmres
    GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
    GMRES: happy breakdown tolerance 1e-30
  maximum iterations=1000
  tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
  left preconditioning
  using nonzero initial guess
  using PRECONDITIONED norm type for convergence test
PC Object: 6 MPI processes
  type: none
  linear system matrix = precond matrix:
  Mat Object:   6 MPI processes
    type: mpidense
    rows=22128, cols=22128
    total: nonzeros=489648384, allocated nonzeros=489648384
    total number of mallocs used during MatSetValues calls =0
  _00001/00001_u1: solve matrix equation use: 213.683270s, with residual norm 4.621066e-01
KSP Object: 6 MPI processes
  type: gmres
    GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
    GMRES: happy breakdown tolerance 1e-30
  maximum iterations=1000
  tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
  left preconditioning
  using nonzero initial guess
  using PRECONDITIONED norm type for convergence test
PC Object: 6 MPI processes
  type: none
  linear system matrix = precond matrix:
  Mat Object:   6 MPI processes
    type: mpidense
    rows=22128, cols=22128
    total: nonzeros=489648384, allocated nonzeros=489648384
    total number of mallocs used during MatSetValues calls =0
  _00001/00001_u2: solve matrix equation use: 208.180939s, with residual norm 4.661823e-01
KSP Object: 6 MPI processes
  type: gmres
    GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
    GMRES: happy breakdown tolerance 1e-30
  maximum iterations=1000
  tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
  left preconditioning
  using nonzero initial guess
  using PRECONDITIONED norm type for convergence test
PC Object: 6 MPI processes
  type: none
  linear system matrix = precond matrix:
  Mat Object:   6 MPI processes
    type: mpidense
    rows=22128, cols=22128
    total: nonzeros=489648384, allocated nonzeros=489648384
    total number of mallocs used during MatSetValues calls =0
  _00001/00001_u3: solve matrix equation use: 194.251886s, with residual norm 1.047436e+00
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

force_pipe.py on a linux-mpich-opblas named cn10 with 6 processors, by zhangji Tue Jun 13 18:11:46 2017
Using Petsc Release Version 3.7.6, Apr, 24, 2017 

                         Max       Max/Min        Avg      Total 
Time (sec):           6.236e+02      1.00013   6.235e+02
Objects:              4.130e+02      1.00000   4.130e+02
Flops:                5.073e+11      1.00081   5.070e+11  3.042e+12
Flops/sec:            8.136e+08      1.00092   8.131e+08  4.879e+09
MPI Messages:         4.200e+01      2.33333   3.000e+01  1.800e+02
MPI Message Lengths:  2.520e+02      2.33333   6.000e+00  1.080e+03
MPI Reductions:       9.541e+03      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 6.2355e+02 100.0%  3.0421e+12 100.0%  1.800e+02 100.0%  6.000e+00      100.0%  9.540e+03 100.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecMDot             3000 1.0 1.7038e+02 1.7 3.41e+08 1.0 0.0e+00 0.0e+00 3.0e+03 20  0  0  0 31  20  0  0  0 31    12
VecNorm             3102 1.0 7.8933e+01 1.2 2.29e+07 1.0 0.0e+00 0.0e+00 3.1e+03 11  0  0  0 33  11  0  0  0 33     2
VecScale            3102 1.0 3.2920e-02 3.1 1.14e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2085
VecCopy             3204 1.0 1.1629e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               123 1.0 1.1544e-0212.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY              201 1.0 1.8733e-03 1.4 1.48e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  4749
VecMAXPY            3102 1.0 3.3990e-01 2.0 3.63e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  6406
VecAssemblyBegin       9 1.0 1.8613e-03 2.0 0.00e+00 0.0 1.8e+02 6.0e+00 2.7e+01  0  0100100  0   0  0100100  0     0
VecAssemblyEnd         9 1.0 3.7193e-05 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin     3114 1.0 8.3257e+01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 3.1e+03 10  0  0  0 33  10  0  0  0 33     0
VecNormalize        3102 1.0 7.8971e+01 1.2 3.43e+07 1.0 0.0e+00 0.0e+00 3.1e+03 11  0  0  0 33  11  0  0  0 33     3
MatMult             3105 1.0 4.4362e+02 1.2 5.07e+11 1.0 0.0e+00 0.0e+00 3.1e+03 67100  0  0 33  67100  0  0 33  6848
MatAssemblyBegin       2 1.0 7.7588e-0211.5 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         2 1.0 1.7595e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatView                3 1.0 6.1056e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCSetUp                3 1.0 9.5367e-07 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply             3102 1.0 1.1835e-01 3.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPGMRESOrthog      3000 1.0 1.7062e+02 1.7 6.82e+08 1.0 0.0e+00 0.0e+00 3.0e+03 20  0  0  0 31  20  0  0  0 31    24
KSPSetUp               3 1.0 3.8290e-04 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               3 1.0 6.1546e+02 1.0 5.07e+11 1.0 0.0e+00 0.0e+00 9.2e+03 99100  0  0 96  99100  0  0 96  4938
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes: