superlu_dist options

Matthew Knepley knepley at gmail.com
Fri May 8 11:03:53 CDT 2009


Look at the timing. The symbolic factorization takes 1e-4 seconds and the
numeric takes
only 10s, out of 542s. MatSolve is taking 517s. If you have a problem, it is
likely there.
However, the MatSolve looks balanced.

  Matt

On Fri, May 8, 2009 at 10:59 AM, Fredrik Bengzon <
fredrik.bengzon at math.umu.se> wrote:

> Hi,
> Here is the output from the KSP and EPS objects, and the log summary.
> / Fredrik
>
>
> Reading Triangle/Tetgen mesh
> #nodes=19345
> #elements=81895
> #nodes per element=4
> Partitioning mesh with METIS 4.0
> Element distribution (rank | #elements)
> 0 | 19771
> 1 | 20954
> 2 | 20611
> 3 | 20559
> rank 1 has 257 ghost nodes
> rank 0 has 127 ghost nodes
> rank 2 has 143 ghost nodes
> rank 3 has 270 ghost nodes
> Calling 3D Navier-Lame Eigenvalue Solver
> Assembling stiffness and mass matrix
> Solving eigensystem with SLEPc
> KSP Object:(st_)
>  type: preonly
>  maximum iterations=100000, initial guess is zero
>  tolerances:  relative=1e-08, absolute=1e-50, divergence=10000
>  left preconditioning
> PC Object:(st_)
>  type: lu
>   LU: out-of-place factorization
>     matrix ordering: natural
>   LU: tolerance for zero pivot 1e-12
> EPS Object:
>  problem type: generalized symmetric eigenvalue problem
>  method: krylovschur
>  extraction type: Rayleigh-Ritz
>  selected portion of the spectrum: largest eigenvalues in magnitude
>  number of eigenvalues (nev): 4
>  number of column vectors (ncv): 19
>  maximum dimension of projected problem (mpd): 19
>  maximum number of iterations: 6108
>  tolerance: 1e-05
>  dimension of user-provided deflation space: 0
>  IP Object:
>   orthogonalization method:   classical Gram-Schmidt
>   orthogonalization refinement:   if needed (eta: 0.707100)
>  ST Object:
>   type: sinvert
>   shift: 0
>  Matrices A and B have same nonzero pattern
>     Associated KSP object
>     ------------------------------
>     KSP Object:(st_)
>       type: preonly
>       maximum iterations=100000, initial guess is zero
>       tolerances:  relative=1e-08, absolute=1e-50, divergence=10000
>       left preconditioning
>     PC Object:(st_)
>       type: lu
>         LU: out-of-place factorization
>           matrix ordering: natural
>         LU: tolerance for zero pivot 1e-12
>         LU: factor fill ratio needed 0
>              Factored matrix follows
>             Matrix Object:
>               type=mpiaij, rows=58035, cols=58035
>               package used to perform factorization: superlu_dist
>               total: nonzeros=0, allocated nonzeros=116070
>                 SuperLU_DIST run parameters:
>                   Process grid nprow 2 x npcol 2
>                   Equilibrate matrix TRUE
>                   Matrix input mode 1
>                   Replace tiny pivots TRUE
>                   Use iterative refinement FALSE
>                   Processors in row 2 col partition 2
>                   Row permutation LargeDiag
>                   Column permutation PARMETIS
>                   Parallel symbolic factorization TRUE
>                   Repeated factorization SamePattern
>       linear system matrix = precond matrix:
>       Matrix Object:
>         type=mpiaij, rows=58035, cols=58035
>         total: nonzeros=2223621, allocated nonzeros=2233584
>           using I-node (on process 0) routines: found 4695 nodes, limit
> used is 5
>     ------------------------------
> Number of iterations in the eigensolver: 1
> Number of requested eigenvalues: 4
> Stopping condition: tol=1e-05, maxit=6108
> Number of converged eigenpairs: 8
>
> Writing binary .vtu file /scratch/fredrik/output/mode-0.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-1.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-2.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-3.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-4.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-5.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-6.vtu
> Writing binary .vtu file /scratch/fredrik/output/mode-7.vtu
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> /home/fredrik/Hakan/cmlfet/a.out on a linux-gnu named medusa1 with 4
> processors, by fredrik Fri May  8 17:57:28 2009
> Using Petsc Release Version 3.0.0, Patch 5, Mon Apr 13 09:15:37 CDT 2009
>
>                        Max       Max/Min        Avg      Total
> Time (sec):           5.429e+02      1.00001   5.429e+02
> Objects:              1.380e+02      1.00000   1.380e+02
> Flops:                1.053e+08      1.05695   1.028e+08  4.114e+08
> Flops/sec:            1.939e+05      1.05696   1.894e+05  7.577e+05
> Memory:               5.927e+07      1.03224              2.339e+08
> MPI Messages:         2.880e+02      1.51579   2.535e+02  1.014e+03
> MPI Message Lengths:  4.868e+07      1.08170   1.827e+05  1.853e+08
> MPI Reductions:       1.122e+02      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                           e.g., VecAXPY() for real vectors of length N -->
> 2N flops
>                           and VecAXPY() for complex vectors of length N -->
> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
>  -- Message Lengths --  -- Reductions --
>                       Avg     %Total     Avg     %Total   counts   %Total
>   Avg         %Total   counts   %Total
> 0:      Main Stage: 5.4292e+02 100.0%  4.1136e+08 100.0%  1.014e+03 100.0%
>  1.827e+05      100.0%  3.600e+02  80.2%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>  Count: number of times phase was executed
>  Time and Flops: Max - maximum over all processors
>                  Ratio - ratio of maximum to minimum over all processors
>  Mess: number of messages sent
>  Avg. len: average message length
>  Reduct: number of global reductions
>  Global: entire computation
>  Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>     %T - percent time in this phase         %F - percent flops in this
> phase
>     %M - percent messages in this phase     %L - percent message lengths in
> this phase
>     %R - percent reductions in this phase
>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>     ##########################################################
>     #                                                        #
>     #                          WARNING!!!                    #
>     #                                                        #
>     #   This code was compiled with a debugging option,      #
>     #   To get timing results run config/configure.py        #
>     #   using --with-debugging=no, the performance will      #
>     #   be generally two or three times faster.              #
>     #                                                        #
>     ##########################################################
>
>
> Event                Count      Time (sec)     Flops
>       --- Global ---  --- Stage ---   Total
>                  Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> STSetUp                1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 8.0e+00  2  0  0  0  2   2  0  0  0  2     0
> STApply               28 1.0 5.1775e+02 1.0 3.15e+07 1.1 1.7e+02 4.2e+03
> 2.8e+01 95 30 17  0  6  95 30 17  0  8     0
> EPSSetUp               1 1.0 1.0482e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 4.6e+01  2  0  0  0 10   2  0  0  0 13     0
> EPSSolve               1 1.0 3.7193e+02 1.0 9.59e+07 1.1 3.5e+02 4.2e+03
> 9.7e+01 69 91 35  1 22  69 91 35  1 27     1
> IPOrthogonalize       19 1.0 3.4406e-01 1.1 6.75e+07 1.1 2.3e+02 4.2e+03
> 7.6e+01  0 64 22  1 17   0 64 22  1 21   767
> IPInnerProduct       153 1.0 3.1410e-01 1.0 5.63e+07 1.1 2.3e+02 4.2e+03
> 3.9e+01  0 53 23  1  9   0 53 23  1 11   700
> IPApplyMatrix         39 1.0 2.4903e-01 1.1 4.38e+07 1.1 2.3e+02 4.2e+03
> 0.0e+00  0 42 23  1  0   0 42 23  1  0   687
> UpdateVectors          1 1.0 4.2958e-03 1.2 4.51e+06 1.1 0.0e+00 0.0e+00
> 0.0e+00  0  4  0  0  0   0  4  0  0  0  4107
> VecDot                 1 1.0 5.6815e-04 4.7 2.97e+04 1.1 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0   204
> VecNorm                8 1.0 2.5260e-03 3.2 2.38e+05 1.1 0.0e+00 0.0e+00
> 8.0e+00  0  0  0  0  2   0  0  0  0  2   368
> VecScale              27 1.0 5.9605e-04 1.1 4.01e+05 1.1 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2629
> VecCopy               53 1.0 4.0610e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                77 1.0 6.2165e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY               38 1.0 2.7709e-03 1.7 1.13e+06 1.1 0.0e+00 0.0e+00
> 0.0e+00  0  1  0  0  0   0  1  0  0  0  1592
> VecMAXPY              38 1.0 2.5925e-02 1.1 1.13e+07 1.1 0.0e+00 0.0e+00
> 0.0e+00  0 11  0  0  0   0 11  0  0  0  1701
> VecAssemblyBegin       5 1.0 9.0070e-03 2.3 0.00e+00 0.0 3.6e+01 2.1e+04
> 1.5e+01  0  0  4  0  3   0  0  4  0  4     0
> VecAssemblyEnd         5 1.0 3.4809e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin       73 1.0 8.5931e-03 1.5 0.00e+00 0.0 4.6e+02 8.9e+03
> 0.0e+00  0  0 45  2  0   0  0 45  2  0     0
> VecScatterEnd         73 1.0 2.2542e-02 2.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecReduceArith        76 1.0 3.0838e-02 1.1 1.24e+07 1.1 0.0e+00 0.0e+00
> 0.0e+00  0 12  0  0  0   0 12  0  0  0  1573
> VecReduceComm         38 1.0 4.8040e-02 2.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 3.8e+01  0  0  0  0  8   0  0  0  0 11     0
> VecNormalize           8 1.0 2.7280e-03 2.8 3.56e+05 1.1 0.0e+00 0.0e+00
> 8.0e+00  0  0  0  0  2   0  0  0  0  2   511
> MatMult               67 1.0 4.1397e-01 1.1 7.53e+07 1.1 4.0e+02 4.2e+03
> 0.0e+00  0 71 40  1  0   0 71 40  1  0   710
> MatSolve              28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 95  0  0  0  0  95  0  0  0  0     0
> MatLUFactorSym         1 1.0 3.6097e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatLUFactorNum         1 1.0 1.0464e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> MatAssemblyBegin       9 1.0 3.3842e-0146.7 0.00e+00 0.0 5.4e+01 6.0e+04
> 8.0e+00  0  0  5  2  2   0  0  5  2  2     0
> MatAssemblyEnd         9 1.0 2.3042e-01 1.0 0.00e+00 0.0 3.6e+01 9.4e+02
> 3.1e+01  0  0  4  0  7   0  0  4  0  9     0
> MatGetRow           5206 1.1 3.1164e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetSubMatrice       5 1.0 8.7580e-01 1.2 0.00e+00 0.0 1.5e+02 1.1e+06
> 2.5e+01  0  0 15 88  6   0  0 15 88  7     0
> MatZeroEntries         2 1.0 1.0233e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatView                2 1.0 1.0149e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  1     0
> KSPSetup               1 1.0 2.8610e-06 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve              28 1.0 5.1758e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.8e+01 95  0  0  0  6  95  0  0  0  8     0
> PCSetUp                1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 8.0e+00  2  0  0  0  2   2  0  0  0  2     0
> PCApply               28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 95  0  0  0  0  95  0  0  0  0     0
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants' Mem.
>
> --- Event Stage 0: Main Stage
>
>  Spectral Transform     1              1        536     0
> Eigenproblem Solver     1              1        824     0
>      Inner product     1              1        428     0
>          Index Set    38             38    1796776     0
>  IS L to G Mapping     1              1      58700     0
>                Vec    65             65    5458584     0
>        Vec Scatter     9              9       7092     0
>  Application Order     1              1     155232     0
>             Matrix    17             16   17715680     0
>      Krylov Solver     1              1        832     0
>     Preconditioner     1              1        744     0
>             Viewer     2              2       1088     0
>
> ========================================================================================================================
> Average time to get PetscTime(): 1.90735e-07
> Average time for MPI_Barrier(): 5.9557e-05
> Average time for zero size MPI_Send(): 2.97427e-05
> #PETSc Option Table entries:
> -log_summary
> -mat_superlu_dist_parsymbfact
> #End o PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Wed May  6 15:14:39 2009
> Configure options: --download-superlu_dist=1 --download-parmetis=1
> --with-mpi-dir=/usr/lib/mpich --with-shared=0
> -----------------------------------------
> Libraries compiled on Wed May  6 15:14:49 CEST 2009 on medusa1
> Machine characteristics: Linux medusa1 2.6.18-6-amd64 #1 SMP Fri Dec 12
> 05:49:32 UTC 2008 x86_64 GNU/Linux
> Using PETSc directory: /home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5
> Using PETSc arch: linux-gnu-c-debug
> -----------------------------------------
> Using C compiler: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings
> -Wno-strict-aliasing -g3  Using Fortran compiler: /usr/lib/mpich/bin/mpif77
> -Wall -Wno-unused-variable -g   -----------------------------------------
> Using include paths:
> -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include
> -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/include
> -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include
> -I/usr/lib/mpich/include  ------------------------------------------
> Using C linker: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings
> -Wno-strict-aliasing -g3
> Using Fortran linker: /usr/lib/mpich/bin/mpif77 -Wall -Wno-unused-variable
> -g Using libraries:
> -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib
> -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib
> -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
>    -lX11
> -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib
> -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib
> -lsuperlu_dist_2.3 -llapack -lblas -lparmetis -lmetis -lm
> -L/usr/lib/mpich/lib -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib64
> -L/lib64 -ldl -lmpich -lpthread -lrt -lgcc_s -lg2c -lm
> -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/lib -lm -ldl -lmpich -lpthread -lrt
> -lgcc_s -ldl
> ------------------------------------------
>
> real    9m10.616s
> user    0m23.921s
> sys    0m6.944s
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Satish Balay wrote:
>
>> Just a note about scalability: its a function of the hardware as
>> well.. For proper scalability studies - you'll need a true distributed
>> system with fast network [not SMP nodes..]
>>
>> Satish
>>
>> On Fri, 8 May 2009, Fredrik Bengzon wrote:
>>
>>
>>
>>> Hong,
>>> Thank you for the suggestions, but I have looked at the EPS and KSP
>>> objects
>>> and I can not find anything wrong. The problem is that it takes longer to
>>> solve with 4 cpus than with 2 so the scalability seems to be absent when
>>> using
>>> superlu_dist. I have stored my mass and stiffness matrix in the mpiaij
>>> format
>>> and just passed them on to slepc. When using the petsc iterative krylov
>>> solvers i see 100% workload on all processors but when i switch to
>>> superlu_dist only two cpus seem to do the whole work of LU factoring. I
>>> don't
>>> want to use the krylov solver though since it might cause slepc not to
>>> converge.
>>> Regards,
>>> Fredrik
>>>
>>> Hong Zhang wrote:
>>>
>>>
>>>> Run your code with '-eps_view -ksp_view' for checking
>>>> which methods are used
>>>> and '-log_summary' to see which operations dominate
>>>> the computation.
>>>>
>>>> You can turn on parallel symbolic factorization
>>>> with '-mat_superlu_dist_parsymbfact'.
>>>>
>>>> Unless you use large num of processors, symbolic factorization
>>>> takes ignorable execution time. The numeric
>>>> factorization usually dominates.
>>>>
>>>> Hong
>>>>
>>>> On Fri, 8 May 2009, Fredrik Bengzon wrote:
>>>>
>>>>
>>>>
>>>>> Hi Petsc team,
>>>>> Sorry for posting questions not really concerning the petsc core, but
>>>>> when
>>>>> I run superlu_dist from within slepc I notice that the load balance is
>>>>> poor. It is just fine during assembly (I use Metis to partition my
>>>>> finite
>>>>> element mesh) but when calling the slepc solver it dramatically
>>>>> changes. I
>>>>> use superlu_dist as solver for the eigenvalue iteration. My question
>>>>> is:
>>>>> can this have something to do with the fact that the option 'Parallel
>>>>> symbolic factorization' is set to false? If so, can I change the
>>>>> options
>>>>> to superlu_dist using MatSetOption for instance? Also, does this mean
>>>>> that
>>>>> superlu_dist is not using parmetis to reorder the matrix?
>>>>> Best Regards,
>>>>> Fredrik Bengzon
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>>
>
>


-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20090508/a7e5866e/attachment-0001.htm>


More information about the petsc-users mailing list