superlu_dist options

Barry Smith bsmith at mcs.anl.gov
Fri May 8 11:15:43 CDT 2009


On May 8, 2009, at 11:03 AM, Matthew Knepley wrote:

> Look at the timing. The symbolic factorization takes 1e-4 seconds  
> and the numeric takes
> only 10s, out of 542s. MatSolve is taking 517s. If you have a  
> problem, it is likely there.
> However, the MatSolve looks balanced.

    Something is funky with this. The 28 solves should not be so much  
more than the numeric factorization.
Perhaps it is worth saving the matrix and reporting this as a  
performance bug to Sherrie.

    Barry

>
>
>   Matt
>
> On Fri, May 8, 2009 at 10:59 AM, Fredrik Bengzon <fredrik.bengzon at math.umu.se 
> > wrote:
> Hi,
> Here is the output from the KSP and EPS objects, and the log summary.
> / Fredrik
>
>
> Reading Triangle/Tetgen mesh
> #nodes=19345
> #elements=81895
> #nodes per element=4
> Partitioning mesh with METIS 4.0
> Element distribution (rank | #elements)
> 0 | 19771
> 1 | 20954
> 2 | 20611
> 3 | 20559
> rank 1 has 257 ghost nodes
> rank 0 has 127 ghost nodes
> rank 2 has 143 ghost nodes
> rank 3 has 270 ghost nodes
> Calling 3D Navier-Lame Eigenvalue Solver
> Assembling stiffness and mass matrix
> Solving eigensystem with SLEPc
> KSP Object:(st_)
>  type: preonly
>  maximum iterations=100000, initial guess is zero
>  tolerances:  relative=1e-08, absolute=1e-50, divergence=10000
>  left preconditioning
> PC Object:(st_)
>  type: lu
>   LU: out-of-place factorization
>     matrix ordering: natural
>   LU: tolerance for zero pivot 1e-12
> EPS Object:
>  problem type: generalized symmetric eigenvalue problem
>  method: krylovschur
>  extraction type: Rayleigh-Ritz
>  selected portion of the spectrum: largest eigenvalues in magnitude
>  number of eigenvalues (nev): 4
>  number of column vectors (ncv): 19
>  maximum dimension of projected problem (mpd): 19
>  maximum number of iterations: 6108
>  tolerance: 1e-05
>  dimension of user-provided deflation space: 0
>  IP Object:
>   orthogonalization method:   classical Gram-Schmidt
>   orthogonalization refinement:   if needed (eta: 0.707100)
>  ST Object:
>   type: sinvert
>   shift: 0
>  Matrices A and B have same nonzero pattern
>     Associated KSP object
>     ------------------------------
>     KSP Object:(st_)
>       type: preonly
>       maximum iterations=100000, initial guess is zero
>       tolerances:  relative=1e-08, absolute=1e-50, divergence=10000
>       left preconditioning
>     PC Object:(st_)
>       type: lu
>         LU: out-of-place factorization
>           matrix ordering: natural
>         LU: tolerance for zero pivot 1e-12
>         LU: factor fill ratio needed 0
>              Factored matrix follows
>             Matrix Object:
>               type=mpiaij, rows=58035, cols=58035
>               package used to perform factorization: superlu_dist
>               total: nonzeros=0, allocated nonzeros=116070
>                 SuperLU_DIST run parameters:
>                   Process grid nprow 2 x npcol 2
>                   Equilibrate matrix TRUE
>                   Matrix input mode 1
>                   Replace tiny pivots TRUE
>                   Use iterative refinement FALSE
>                   Processors in row 2 col partition 2
>                   Row permutation LargeDiag
>                   Column permutation PARMETIS
>                   Parallel symbolic factorization TRUE
>                   Repeated factorization SamePattern
>       linear system matrix = precond matrix:
>       Matrix Object:
>         type=mpiaij, rows=58035, cols=58035
>         total: nonzeros=2223621, allocated nonzeros=2233584
>           using I-node (on process 0) routines: found 4695 nodes,  
> limit used is 5
>     ------------------------------
> Number of iterations in the eigensolver: 1
> Number of requested eigenvalues: 4
> Stopping condition: tol=1e-05, maxit=6108
> Number of converged eigenpairs: 8
>
> Writing binary .vtu file /scratch/fredrik/output/mode-0.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-1.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-2.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-3.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-4.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-5.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-6.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-7.vtu
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript - 
> r -fCourier9' to print this document            ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance  
> Summary: ----------------------------------------------
>
> /home/fredrik/Hakan/cmlfet/a.out on a linux-gnu named medusa1 with 4  
> processors, by fredrik Fri May  8 17:57:28 2009
> Using Petsc Release Version 3.0.0, Patch 5, Mon Apr 13 09:15:37 CDT  
> 2009
>
>                        Max       Max/Min        Avg      Total
> Time (sec):           5.429e+02      1.00001   5.429e+02
> Objects:              1.380e+02      1.00000   1.380e+02
> Flops:                1.053e+08      1.05695   1.028e+08  4.114e+08
> Flops/sec:            1.939e+05      1.05696   1.894e+05  7.577e+05
> Memory:               5.927e+07      1.03224              2.339e+08
> MPI Messages:         2.880e+02      1.51579   2.535e+02  1.014e+03
> MPI Message Lengths:  4.868e+07      1.08170   1.827e+05  1.853e+08
> MPI Reductions:       1.122e+02      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type  
> (multiply/divide/add/subtract)
>                           e.g., VecAXPY() for real vectors of length  
> N --> 2N flops
>                           and VecAXPY() for complex vectors of  
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---  
> Messages ---  -- Message Lengths --  -- Reductions --
>                       Avg     %Total     Avg     %Total   counts    
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 5.4292e+02 100.0%  4.1136e+08 100.0%  1.014e+03  
> 100.0%  1.827e+05      100.0%  3.600e+02  80.2%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on  
> interpreting output.
> Phase summary info:
>  Count: number of times phase was executed
>  Time and Flops: Max - maximum over all processors
>                  Ratio - ratio of maximum to minimum over all  
> processors
>  Mess: number of messages sent
>  Avg. len: average message length
>  Reduct: number of global reductions
>  Global: entire computation
>  Stage: stages of a computation. Set stages with PetscLogStagePush()  
> and PetscLogStagePop().
>     %T - percent time in this phase         %F - percent flops in  
> this phase
>     %M - percent messages in this phase     %L - percent message  
> lengths in this phase
>     %R - percent reductions in this phase
>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time  
> over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>     ##########################################################
>     #                                                        #
>     #                          WARNING!!!                    #
>     #                                                        #
>     #   This code was compiled with a debugging option,      #
>     #   To get timing results run config/configure.py        #
>     #   using --with-debugging=no, the performance will      #
>     #   be generally two or three times faster.              #
>     #                                                        #
>     ##########################################################
>
>
> Event                Count      Time (sec)      
> Flops                             --- Global ---  --- Stage ---    
> Total
>                  Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> STSetUp                1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 8.0e+00  2  0  0  0  2   2  0  0  0  2     0
> STApply               28 1.0 5.1775e+02 1.0 3.15e+07 1.1 1.7e+02 4.2e 
> +03 2.8e+01 95 30 17  0  6  95 30 17  0  8     0
> EPSSetUp               1 1.0 1.0482e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 4.6e+01  2  0  0  0 10   2  0  0  0 13     0
> EPSSolve               1 1.0 3.7193e+02 1.0 9.59e+07 1.1 3.5e+02 4.2e 
> +03 9.7e+01 69 91 35  1 22  69 91 35  1 27     1
> IPOrthogonalize       19 1.0 3.4406e-01 1.1 6.75e+07 1.1 2.3e+02 4.2e 
> +03 7.6e+01  0 64 22  1 17   0 64 22  1 21   767
> IPInnerProduct       153 1.0 3.1410e-01 1.0 5.63e+07 1.1 2.3e+02 4.2e 
> +03 3.9e+01  0 53 23  1  9   0 53 23  1 11   700
> IPApplyMatrix         39 1.0 2.4903e-01 1.1 4.38e+07 1.1 2.3e+02 4.2e 
> +03 0.0e+00  0 42 23  1  0   0 42 23  1  0   687
> UpdateVectors          1 1.0 4.2958e-03 1.2 4.51e+06 1.1 0.0e+00 0.0e 
> +00 0.0e+00  0  4  0  0  0   0  4  0  0  0  4107
> VecDot                 1 1.0 5.6815e-04 4.7 2.97e+04 1.1 0.0e+00 0.0e 
> +00 1.0e+00  0  0  0  0  0   0  0  0  0  0   204
> VecNorm                8 1.0 2.5260e-03 3.2 2.38e+05 1.1 0.0e+00 0.0e 
> +00 8.0e+00  0  0  0  0  2   0  0  0  0  2   368
> VecScale              27 1.0 5.9605e-04 1.1 4.01e+05 1.1 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2629
> VecCopy               53 1.0 4.0610e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                77 1.0 6.2165e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY               38 1.0 2.7709e-03 1.7 1.13e+06 1.1 0.0e+00 0.0e 
> +00 0.0e+00  0  1  0  0  0   0  1  0  0  0  1592
> VecMAXPY              38 1.0 2.5925e-02 1.1 1.13e+07 1.1 0.0e+00 0.0e 
> +00 0.0e+00  0 11  0  0  0   0 11  0  0  0  1701
> VecAssemblyBegin       5 1.0 9.0070e-03 2.3 0.00e+00 0.0 3.6e+01 2.1e 
> +04 1.5e+01  0  0  4  0  3   0  0  4  0  4     0
> VecAssemblyEnd         5 1.0 3.4809e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin       73 1.0 8.5931e-03 1.5 0.00e+00 0.0 4.6e+02 8.9e 
> +03 0.0e+00  0  0 45  2  0   0  0 45  2  0     0
> VecScatterEnd         73 1.0 2.2542e-02 2.2 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecReduceArith        76 1.0 3.0838e-02 1.1 1.24e+07 1.1 0.0e+00 0.0e 
> +00 0.0e+00  0 12  0  0  0   0 12  0  0  0  1573
> VecReduceComm         38 1.0 4.8040e-02 2.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 3.8e+01  0  0  0  0  8   0  0  0  0 11     0
> VecNormalize           8 1.0 2.7280e-03 2.8 3.56e+05 1.1 0.0e+00 0.0e 
> +00 8.0e+00  0  0  0  0  2   0  0  0  0  2   511
> MatMult               67 1.0 4.1397e-01 1.1 7.53e+07 1.1 4.0e+02 4.2e 
> +03 0.0e+00  0 71 40  1  0   0 71 40  1  0   710
> MatSolve              28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00 95  0  0  0  0  95  0  0  0  0     0
> MatLUFactorSym         1 1.0 3.6097e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatLUFactorNum         1 1.0 1.0464e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> MatAssemblyBegin       9 1.0 3.3842e-0146.7 0.00e+00 0.0 5.4e+01 6.0e 
> +04 8.0e+00  0  0  5  2  2   0  0  5  2  2     0
> MatAssemblyEnd         9 1.0 2.3042e-01 1.0 0.00e+00 0.0 3.6e+01 9.4e 
> +02 3.1e+01  0  0  4  0  7   0  0  4  0  9     0
> MatGetRow           5206 1.1 3.1164e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetSubMatrice       5 1.0 8.7580e-01 1.2 0.00e+00 0.0 1.5e+02 1.1e 
> +06 2.5e+01  0  0 15 88  6   0  0 15 88  7     0
> MatZeroEntries         2 1.0 1.0233e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatView                2 1.0 1.0149e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.0e+00  0  0  0  0  0   0  0  0  0  1     0
> KSPSetup               1 1.0 2.8610e-06 1.5 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve              28 1.0 5.1758e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.8e+01 95  0  0  0  6  95  0  0  0  8     0
> PCSetUp                1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 8.0e+00  2  0  0  0  2   2  0  0  0  2     0
> PCApply               28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00 95  0  0  0  0  95  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>  Spectral Transform     1              1        536     0
> Eigenproblem Solver     1              1        824     0
>      Inner product     1              1        428     0
>          Index Set    38             38    1796776     0
>  IS L to G Mapping     1              1      58700     0
>                Vec    65             65    5458584     0
>        Vec Scatter     9              9       7092     0
>  Application Order     1              1     155232     0
>             Matrix    17             16   17715680     0
>      Krylov Solver     1              1        832     0
>     Preconditioner     1              1        744     0
>             Viewer     2              2       1088     0
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 1.90735e-07
> Average time for MPI_Barrier(): 5.9557e-05
> Average time for zero size MPI_Send(): 2.97427e-05
> #PETSc Option Table entries:
> -log_summary
> -mat_superlu_dist_parsymbfact
> #End o PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8  
> sizeof(PetscScalar) 8
> Configure run at: Wed May  6 15:14:39 2009
> Configure options: --download-superlu_dist=1 --download-parmetis=1 -- 
> with-mpi-dir=/usr/lib/mpich --with-shared=0
> -----------------------------------------
> Libraries compiled on Wed May  6 15:14:49 CEST 2009 on medusa1
> Machine characteristics: Linux medusa1 2.6.18-6-amd64 #1 SMP Fri Dec  
> 12 05:49:32 UTC 2008 x86_64 GNU/Linux
> Using PETSc directory: /home/fredrik/Hakan/cmlfet/external/ 
> petsc-3.0.0-p5
> Using PETSc arch: linux-gnu-c-debug
> -----------------------------------------
> Using C compiler: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings - 
> Wno-strict-aliasing -g3  Using Fortran compiler: /usr/lib/mpich/bin/ 
> mpif77 -Wall -Wno-unused-variable -g    
> -----------------------------------------
> Using include paths: -I/home/fredrik/Hakan/cmlfet/external/ 
> petsc-3.0.0-p5/linux-gnu-c-debug/include -I/home/fredrik/Hakan/ 
> cmlfet/external/petsc-3.0.0-p5/include -I/home/fredrik/Hakan/cmlfet/ 
> external/petsc-3.0.0-p5/linux-gnu-c-debug/include -I/usr/lib/mpich/ 
> include  ------------------------------------------
> Using C linker: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings -Wno- 
> strict-aliasing -g3
> Using Fortran linker: /usr/lib/mpich/bin/mpif77 -Wall -Wno-unused- 
> variable -g Using libraries: -Wl,-rpath,/home/fredrik/Hakan/cmlfet/ 
> external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -L/home/fredrik/Hakan/ 
> cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -lpetscts - 
> lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc         
> -lX11 -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/ 
> linux-gnu-c-debug/lib -L/home/fredrik/Hakan/cmlfet/external/ 
> petsc-3.0.0-p5/linux-gnu-c-debug/lib -lsuperlu_dist_2.3 -llapack - 
> lblas -lparmetis -lmetis -lm -L/usr/lib/mpich/lib -L/usr/lib/gcc/ 
> x86_64-linux-gnu/4.1.2 -L/usr/lib64 -L/lib64 -ldl -lmpich -lpthread - 
> lrt -lgcc_s -lg2c -lm -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/lib - 
> lm -ldl -lmpich -lpthread -lrt -lgcc_s -ldl
> ------------------------------------------
>
> real    9m10.616s
> user    0m23.921s
> sys    0m6.944s
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Satish Balay wrote:
> Just a note about scalability: its a function of the hardware as
> well.. For proper scalability studies - you'll need a true distributed
> system with fast network [not SMP nodes..]
>
> Satish
>
> On Fri, 8 May 2009, Fredrik Bengzon wrote:
>
>
> Hong,
> Thank you for the suggestions, but I have looked at the EPS and KSP  
> objects
> and I can not find anything wrong. The problem is that it takes  
> longer to
> solve with 4 cpus than with 2 so the scalability seems to be absent  
> when using
> superlu_dist. I have stored my mass and stiffness matrix in the  
> mpiaij format
> and just passed them on to slepc. When using the petsc iterative  
> krylov
> solvers i see 100% workload on all processors but when i switch to
> superlu_dist only two cpus seem to do the whole work of LU  
> factoring. I don't
> want to use the krylov solver though since it might cause slepc not to
> converge.
> Regards,
> Fredrik
>
> Hong Zhang wrote:
>
> Run your code with '-eps_view -ksp_view' for checking
> which methods are used
> and '-log_summary' to see which operations dominate
> the computation.
>
> You can turn on parallel symbolic factorization
> with '-mat_superlu_dist_parsymbfact'.
>
> Unless you use large num of processors, symbolic factorization
> takes ignorable execution time. The numeric
> factorization usually dominates.
>
> Hong
>
> On Fri, 8 May 2009, Fredrik Bengzon wrote:
>
>
> Hi Petsc team,
> Sorry for posting questions not really concerning the petsc core,  
> but when
> I run superlu_dist from within slepc I notice that the load balance is
> poor. It is just fine during assembly (I use Metis to partition my  
> finite
> element mesh) but when calling the slepc solver it dramatically  
> changes. I
> use superlu_dist as solver for the eigenvalue iteration. My question  
> is:
> can this have something to do with the fact that the option 'Parallel
> symbolic factorization' is set to false? If so, can I change the  
> options
> to superlu_dist using MatSetOption for instance? Also, does this  
> mean that
> superlu_dist is not using parmetis to reorder the matrix?
> Best Regards,
> Fredrik Bengzon
>
>
>
>
>
>
>
>
>
>
>
> -- 
> What most experimenters take for granted before they begin their  
> experiments is infinitely more interesting than any results to which  
> their experiments lead.
> -- Norbert Wiener



More information about the petsc-users mailing list