superlu_dist options
Fredrik Bengzon
fredrik.bengzon at math.umu.se
Fri May 8 17:26:28 CDT 2009
Hi again,
I resorted to using Mumps, which seems to scale very well, in Slepc.
However I have another question: how do you sort an MPI vector in Petsc,
and can you get the permutation also?
/Fredrik
Barry Smith wrote:
>
> On May 8, 2009, at 11:03 AM, Matthew Knepley wrote:
>
>> Look at the timing. The symbolic factorization takes 1e-4 seconds and
>> the numeric takes
>> only 10s, out of 542s. MatSolve is taking 517s. If you have a
>> problem, it is likely there.
>> However, the MatSolve looks balanced.
>
> Something is funky with this. The 28 solves should not be so much
> more than the numeric factorization.
> Perhaps it is worth saving the matrix and reporting this as a
> performance bug to Sherrie.
>
> Barry
>
>>
>>
>> Matt
>>
>> On Fri, May 8, 2009 at 10:59 AM, Fredrik Bengzon
>> <fredrik.bengzon at math.umu.se> wrote:
>> Hi,
>> Here is the output from the KSP and EPS objects, and the log summary.
>> / Fredrik
>>
>>
>> Reading Triangle/Tetgen mesh
>> #nodes=19345
>> #elements=81895
>> #nodes per element=4
>> Partitioning mesh with METIS 4.0
>> Element distribution (rank | #elements)
>> 0 | 19771
>> 1 | 20954
>> 2 | 20611
>> 3 | 20559
>> rank 1 has 257 ghost nodes
>> rank 0 has 127 ghost nodes
>> rank 2 has 143 ghost nodes
>> rank 3 has 270 ghost nodes
>> Calling 3D Navier-Lame Eigenvalue Solver
>> Assembling stiffness and mass matrix
>> Solving eigensystem with SLEPc
>> KSP Object:(st_)
>> type: preonly
>> maximum iterations=100000, initial guess is zero
>> tolerances: relative=1e-08, absolute=1e-50, divergence=10000
>> left preconditioning
>> PC Object:(st_)
>> type: lu
>> LU: out-of-place factorization
>> matrix ordering: natural
>> LU: tolerance for zero pivot 1e-12
>> EPS Object:
>> problem type: generalized symmetric eigenvalue problem
>> method: krylovschur
>> extraction type: Rayleigh-Ritz
>> selected portion of the spectrum: largest eigenvalues in magnitude
>> number of eigenvalues (nev): 4
>> number of column vectors (ncv): 19
>> maximum dimension of projected problem (mpd): 19
>> maximum number of iterations: 6108
>> tolerance: 1e-05
>> dimension of user-provided deflation space: 0
>> IP Object:
>> orthogonalization method: classical Gram-Schmidt
>> orthogonalization refinement: if needed (eta: 0.707100)
>> ST Object:
>> type: sinvert
>> shift: 0
>> Matrices A and B have same nonzero pattern
>> Associated KSP object
>> ------------------------------
>> KSP Object:(st_)
>> type: preonly
>> maximum iterations=100000, initial guess is zero
>> tolerances: relative=1e-08, absolute=1e-50, divergence=10000
>> left preconditioning
>> PC Object:(st_)
>> type: lu
>> LU: out-of-place factorization
>> matrix ordering: natural
>> LU: tolerance for zero pivot 1e-12
>> LU: factor fill ratio needed 0
>> Factored matrix follows
>> Matrix Object:
>> type=mpiaij, rows=58035, cols=58035
>> package used to perform factorization: superlu_dist
>> total: nonzeros=0, allocated nonzeros=116070
>> SuperLU_DIST run parameters:
>> Process grid nprow 2 x npcol 2
>> Equilibrate matrix TRUE
>> Matrix input mode 1
>> Replace tiny pivots TRUE
>> Use iterative refinement FALSE
>> Processors in row 2 col partition 2
>> Row permutation LargeDiag
>> Column permutation PARMETIS
>> Parallel symbolic factorization TRUE
>> Repeated factorization SamePattern
>> linear system matrix = precond matrix:
>> Matrix Object:
>> type=mpiaij, rows=58035, cols=58035
>> total: nonzeros=2223621, allocated nonzeros=2233584
>> using I-node (on process 0) routines: found 4695 nodes,
>> limit used is 5
>> ------------------------------
>> Number of iterations in the eigensolver: 1
>> Number of requested eigenvalues: 4
>> Stopping condition: tol=1e-05, maxit=6108
>> Number of converged eigenpairs: 8
>>
>> Writing binary .vtu file /scratch/fredrik/output/mode-0.vtu
>> Writing binary .vtu file /scratch/fredrik/output/mode-1.vtu
>> Writing binary .vtu file /scratch/fredrik/output/mode-2.vtu
>> Writing binary .vtu file /scratch/fredrik/output/mode-3.vtu
>> Writing binary .vtu file /scratch/fredrik/output/mode-4.vtu
>> Writing binary .vtu file /scratch/fredrik/output/mode-5.vtu
>> Writing binary .vtu file /scratch/fredrik/output/mode-6.vtu
>> Writing binary .vtu file /scratch/fredrik/output/mode-7.vtu
>> ************************************************************************************************************************
>>
>> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript
>> -r -fCourier9' to print this document ***
>> ************************************************************************************************************************
>>
>>
>> ---------------------------------------------- PETSc Performance
>> Summary: ----------------------------------------------
>>
>> /home/fredrik/Hakan/cmlfet/a.out on a linux-gnu named medusa1 with 4
>> processors, by fredrik Fri May 8 17:57:28 2009
>> Using Petsc Release Version 3.0.0, Patch 5, Mon Apr 13 09:15:37 CDT 2009
>>
>> Max Max/Min Avg Total
>> Time (sec): 5.429e+02 1.00001 5.429e+02
>> Objects: 1.380e+02 1.00000 1.380e+02
>> Flops: 1.053e+08 1.05695 1.028e+08 4.114e+08
>> Flops/sec: 1.939e+05 1.05696 1.894e+05 7.577e+05
>> Memory: 5.927e+07 1.03224 2.339e+08
>> MPI Messages: 2.880e+02 1.51579 2.535e+02 1.014e+03
>> MPI Message Lengths: 4.868e+07 1.08170 1.827e+05 1.853e+08
>> MPI Reductions: 1.122e+02 1.00000
>>
>> Flop counting convention: 1 flop = 1 real number operation of type
>> (multiply/divide/add/subtract)
>> e.g., VecAXPY() for real vectors of length
>> N --> 2N flops
>> and VecAXPY() for complex vectors of length
>> N --> 8N flops
>>
>> Summary of Stages: ----- Time ------ ----- Flops ----- ---
>> Messages --- -- Message Lengths -- -- Reductions --
>> Avg %Total Avg %Total counts
>> %Total Avg %Total counts %Total
>> 0: Main Stage: 5.4292e+02 100.0% 4.1136e+08 100.0% 1.014e+03
>> 100.0% 1.827e+05 100.0% 3.600e+02 80.2%
>>
>> ------------------------------------------------------------------------------------------------------------------------
>>
>> See the 'Profiling' chapter of the users' manual for details on
>> interpreting output.
>> Phase summary info:
>> Count: number of times phase was executed
>> Time and Flops: Max - maximum over all processors
>> Ratio - ratio of maximum to minimum over all processors
>> Mess: number of messages sent
>> Avg. len: average message length
>> Reduct: number of global reductions
>> Global: entire computation
>> Stage: stages of a computation. Set stages with PetscLogStagePush()
>> and PetscLogStagePop().
>> %T - percent time in this phase %F - percent flops in
>> this phase
>> %M - percent messages in this phase %L - percent message
>> lengths in this phase
>> %R - percent reductions in this phase
>> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
>> over all processors)
>> ------------------------------------------------------------------------------------------------------------------------
>>
>>
>>
>> ##########################################################
>> # #
>> # WARNING!!! #
>> # #
>> # This code was compiled with a debugging option, #
>> # To get timing results run config/configure.py #
>> # using --with-debugging=no, the performance will #
>> # be generally two or three times faster. #
>> # #
>> ##########################################################
>>
>>
>> Event Count Time (sec)
>> Flops --- Global --- --- Stage --- Total
>> Max Ratio Max Ratio Max Ratio Mess Avg
>> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>> ------------------------------------------------------------------------------------------------------------------------
>>
>>
>> --- Event Stage 0: Main Stage
>>
>> STSetUp 1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 8.0e+00 2 0 0 0 2 2 0 0 0 2 0
>> STApply 28 1.0 5.1775e+02 1.0 3.15e+07 1.1 1.7e+02
>> 4.2e+03 2.8e+01 95 30 17 0 6 95 30 17 0 8 0
>> EPSSetUp 1 1.0 1.0482e+01 1.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 4.6e+01 2 0 0 0 10 2 0 0 0 13 0
>> EPSSolve 1 1.0 3.7193e+02 1.0 9.59e+07 1.1 3.5e+02
>> 4.2e+03 9.7e+01 69 91 35 1 22 69 91 35 1 27 1
>> IPOrthogonalize 19 1.0 3.4406e-01 1.1 6.75e+07 1.1 2.3e+02
>> 4.2e+03 7.6e+01 0 64 22 1 17 0 64 22 1 21 767
>> IPInnerProduct 153 1.0 3.1410e-01 1.0 5.63e+07 1.1 2.3e+02
>> 4.2e+03 3.9e+01 0 53 23 1 9 0 53 23 1 11 700
>> IPApplyMatrix 39 1.0 2.4903e-01 1.1 4.38e+07 1.1 2.3e+02
>> 4.2e+03 0.0e+00 0 42 23 1 0 0 42 23 1 0 687
>> UpdateVectors 1 1.0 4.2958e-03 1.2 4.51e+06 1.1 0.0e+00
>> 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 4107
>> VecDot 1 1.0 5.6815e-04 4.7 2.97e+04 1.1 0.0e+00
>> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 204
>> VecNorm 8 1.0 2.5260e-03 3.2 2.38e+05 1.1 0.0e+00
>> 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 368
>> VecScale 27 1.0 5.9605e-04 1.1 4.01e+05 1.1 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2629
>> VecCopy 53 1.0 4.0610e-03 1.4 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> VecSet 77 1.0 6.2165e-03 1.1 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> VecAXPY 38 1.0 2.7709e-03 1.7 1.13e+06 1.1 0.0e+00
>> 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1592
>> VecMAXPY 38 1.0 2.5925e-02 1.1 1.13e+07 1.1 0.0e+00
>> 0.0e+00 0.0e+00 0 11 0 0 0 0 11 0 0 0 1701
>> VecAssemblyBegin 5 1.0 9.0070e-03 2.3 0.00e+00 0.0 3.6e+01
>> 2.1e+04 1.5e+01 0 0 4 0 3 0 0 4 0 4 0
>> VecAssemblyEnd 5 1.0 3.4809e-04 1.1 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> VecScatterBegin 73 1.0 8.5931e-03 1.5 0.00e+00 0.0 4.6e+02
>> 8.9e+03 0.0e+00 0 0 45 2 0 0 0 45 2 0 0
>> VecScatterEnd 73 1.0 2.2542e-02 2.2 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> VecReduceArith 76 1.0 3.0838e-02 1.1 1.24e+07 1.1 0.0e+00
>> 0.0e+00 0.0e+00 0 12 0 0 0 0 12 0 0 0 1573
>> VecReduceComm 38 1.0 4.8040e-02 2.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 3.8e+01 0 0 0 0 8 0 0 0 0 11 0
>> VecNormalize 8 1.0 2.7280e-03 2.8 3.56e+05 1.1 0.0e+00
>> 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 511
>> MatMult 67 1.0 4.1397e-01 1.1 7.53e+07 1.1 4.0e+02
>> 4.2e+03 0.0e+00 0 71 40 1 0 0 71 40 1 0 710
>> MatSolve 28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 95 0 0 0 0 95 0 0 0 0 0
>> MatLUFactorSym 1 1.0 3.6097e-04 1.1 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> MatLUFactorNum 1 1.0 1.0464e+01 1.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
>> MatAssemblyBegin 9 1.0 3.3842e-0146.7 0.00e+00 0.0 5.4e+01
>> 6.0e+04 8.0e+00 0 0 5 2 2 0 0 5 2 2 0
>> MatAssemblyEnd 9 1.0 2.3042e-01 1.0 0.00e+00 0.0 3.6e+01
>> 9.4e+02 3.1e+01 0 0 4 0 7 0 0 4 0 9 0
>> MatGetRow 5206 1.1 3.1164e-03 1.1 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> MatGetSubMatrice 5 1.0 8.7580e-01 1.2 0.00e+00 0.0 1.5e+02
>> 1.1e+06 2.5e+01 0 0 15 88 6 0 0 15 88 7 0
>> MatZeroEntries 2 1.0 1.0233e-02 1.1 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> MatView 2 1.0 1.0149e-03 2.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 1 0
>> KSPSetup 1 1.0 2.8610e-06 1.5 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> KSPSolve 28 1.0 5.1758e+02 1.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 2.8e+01 95 0 0 0 6 95 0 0 0 8 0
>> PCSetUp 1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 8.0e+00 2 0 0 0 2 2 0 0 0 2 0
>> PCApply 28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 95 0 0 0 0 95 0 0 0 0 0
>> ------------------------------------------------------------------------------------------------------------------------
>>
>>
>> Memory usage is given in bytes:
>>
>> Object Type Creations Destructions Memory Descendants'
>> Mem.
>>
>> --- Event Stage 0: Main Stage
>>
>> Spectral Transform 1 1 536 0
>> Eigenproblem Solver 1 1 824 0
>> Inner product 1 1 428 0
>> Index Set 38 38 1796776 0
>> IS L to G Mapping 1 1 58700 0
>> Vec 65 65 5458584 0
>> Vec Scatter 9 9 7092 0
>> Application Order 1 1 155232 0
>> Matrix 17 16 17715680 0
>> Krylov Solver 1 1 832 0
>> Preconditioner 1 1 744 0
>> Viewer 2 2 1088 0
>> ========================================================================================================================
>>
>> Average time to get PetscTime(): 1.90735e-07
>> Average time for MPI_Barrier(): 5.9557e-05
>> Average time for zero size MPI_Send(): 2.97427e-05
>> #PETSc Option Table entries:
>> -log_summary
>> -mat_superlu_dist_parsymbfact
>> #End o PETSc Option Table entries
>> Compiled without FORTRAN kernels
>> Compiled with full precision matrices (default)
>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>> sizeof(PetscScalar) 8
>> Configure run at: Wed May 6 15:14:39 2009
>> Configure options: --download-superlu_dist=1 --download-parmetis=1
>> --with-mpi-dir=/usr/lib/mpich --with-shared=0
>> -----------------------------------------
>> Libraries compiled on Wed May 6 15:14:49 CEST 2009 on medusa1
>> Machine characteristics: Linux medusa1 2.6.18-6-amd64 #1 SMP Fri Dec
>> 12 05:49:32 UTC 2008 x86_64 GNU/Linux
>> Using PETSc directory:
>> /home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5
>> Using PETSc arch: linux-gnu-c-debug
>> -----------------------------------------
>> Using C compiler: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings
>> -Wno-strict-aliasing -g3 Using Fortran compiler:
>> /usr/lib/mpich/bin/mpif77 -Wall -Wno-unused-variable -g
>> -----------------------------------------
>> Using include paths:
>> -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include
>> -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/include
>> -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include
>> -I/usr/lib/mpich/include ------------------------------------------
>> Using C linker: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings
>> -Wno-strict-aliasing -g3
>> Using Fortran linker: /usr/lib/mpich/bin/mpif77 -Wall
>> -Wno-unused-variable -g Using libraries:
>> -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib
>> -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib
>> -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec
>> -lpetsc -lX11
>> -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib
>> -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib
>> -lsuperlu_dist_2.3 -llapack -lblas -lparmetis -lmetis -lm
>> -L/usr/lib/mpich/lib -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2
>> -L/usr/lib64 -L/lib64 -ldl -lmpich -lpthread -lrt -lgcc_s -lg2c -lm
>> -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/lib -lm -ldl -lmpich
>> -lpthread -lrt -lgcc_s -ldl
>> ------------------------------------------
>>
>> real 9m10.616s
>> user 0m23.921s
>> sys 0m6.944s
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Satish Balay wrote:
>> Just a note about scalability: its a function of the hardware as
>> well.. For proper scalability studies - you'll need a true distributed
>> system with fast network [not SMP nodes..]
>>
>> Satish
>>
>> On Fri, 8 May 2009, Fredrik Bengzon wrote:
>>
>>
>> Hong,
>> Thank you for the suggestions, but I have looked at the EPS and KSP
>> objects
>> and I can not find anything wrong. The problem is that it takes
>> longer to
>> solve with 4 cpus than with 2 so the scalability seems to be absent
>> when using
>> superlu_dist. I have stored my mass and stiffness matrix in the
>> mpiaij format
>> and just passed them on to slepc. When using the petsc iterative krylov
>> solvers i see 100% workload on all processors but when i switch to
>> superlu_dist only two cpus seem to do the whole work of LU factoring.
>> I don't
>> want to use the krylov solver though since it might cause slepc not to
>> converge.
>> Regards,
>> Fredrik
>>
>> Hong Zhang wrote:
>>
>> Run your code with '-eps_view -ksp_view' for checking
>> which methods are used
>> and '-log_summary' to see which operations dominate
>> the computation.
>>
>> You can turn on parallel symbolic factorization
>> with '-mat_superlu_dist_parsymbfact'.
>>
>> Unless you use large num of processors, symbolic factorization
>> takes ignorable execution time. The numeric
>> factorization usually dominates.
>>
>> Hong
>>
>> On Fri, 8 May 2009, Fredrik Bengzon wrote:
>>
>>
>> Hi Petsc team,
>> Sorry for posting questions not really concerning the petsc core, but
>> when
>> I run superlu_dist from within slepc I notice that the load balance is
>> poor. It is just fine during assembly (I use Metis to partition my
>> finite
>> element mesh) but when calling the slepc solver it dramatically
>> changes. I
>> use superlu_dist as solver for the eigenvalue iteration. My question is:
>> can this have something to do with the fact that the option 'Parallel
>> symbolic factorization' is set to false? If so, can I change the options
>> to superlu_dist using MatSetOption for instance? Also, does this mean
>> that
>> superlu_dist is not using parmetis to reorder the matrix?
>> Best Regards,
>> Fredrik Bengzon
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which
>> their experiments lead.
>> -- Norbert Wiener
>
>
More information about the petsc-users
mailing list