superlu_dist options
Barry Smith
bsmith at mcs.anl.gov
Fri May 8 17:28:16 CDT 2009
I don't think we have any parallel sorts in PETSc.
Barry
On May 8, 2009, at 5:26 PM, Fredrik Bengzon wrote:
> Hi again,
> I resorted to using Mumps, which seems to scale very well, in Slepc.
> However I have another question: how do you sort an MPI vector in
> Petsc, and can you get the permutation also?
> /Fredrik
>
>
> Barry Smith wrote:
>>
>> On May 8, 2009, at 11:03 AM, Matthew Knepley wrote:
>>
>>> Look at the timing. The symbolic factorization takes 1e-4 seconds
>>> and the numeric takes
>>> only 10s, out of 542s. MatSolve is taking 517s. If you have a
>>> problem, it is likely there.
>>> However, the MatSolve looks balanced.
>>
>> Something is funky with this. The 28 solves should not be so much
>> more than the numeric factorization.
>> Perhaps it is worth saving the matrix and reporting this as a
>> performance bug to Sherrie.
>>
>> Barry
>>
>>>
>>>
>>> Matt
>>>
>>> On Fri, May 8, 2009 at 10:59 AM, Fredrik Bengzon <fredrik.bengzon at math.umu.se
>>> > wrote:
>>> Hi,
>>> Here is the output from the KSP and EPS objects, and the log
>>> summary.
>>> / Fredrik
>>>
>>>
>>> Reading Triangle/Tetgen mesh
>>> #nodes=19345
>>> #elements=81895
>>> #nodes per element=4
>>> Partitioning mesh with METIS 4.0
>>> Element distribution (rank | #elements)
>>> 0 | 19771
>>> 1 | 20954
>>> 2 | 20611
>>> 3 | 20559
>>> rank 1 has 257 ghost nodes
>>> rank 0 has 127 ghost nodes
>>> rank 2 has 143 ghost nodes
>>> rank 3 has 270 ghost nodes
>>> Calling 3D Navier-Lame Eigenvalue Solver
>>> Assembling stiffness and mass matrix
>>> Solving eigensystem with SLEPc
>>> KSP Object:(st_)
>>> type: preonly
>>> maximum iterations=100000, initial guess is zero
>>> tolerances: relative=1e-08, absolute=1e-50, divergence=10000
>>> left preconditioning
>>> PC Object:(st_)
>>> type: lu
>>> LU: out-of-place factorization
>>> matrix ordering: natural
>>> LU: tolerance for zero pivot 1e-12
>>> EPS Object:
>>> problem type: generalized symmetric eigenvalue problem
>>> method: krylovschur
>>> extraction type: Rayleigh-Ritz
>>> selected portion of the spectrum: largest eigenvalues in magnitude
>>> number of eigenvalues (nev): 4
>>> number of column vectors (ncv): 19
>>> maximum dimension of projected problem (mpd): 19
>>> maximum number of iterations: 6108
>>> tolerance: 1e-05
>>> dimension of user-provided deflation space: 0
>>> IP Object:
>>> orthogonalization method: classical Gram-Schmidt
>>> orthogonalization refinement: if needed (eta: 0.707100)
>>> ST Object:
>>> type: sinvert
>>> shift: 0
>>> Matrices A and B have same nonzero pattern
>>> Associated KSP object
>>> ------------------------------
>>> KSP Object:(st_)
>>> type: preonly
>>> maximum iterations=100000, initial guess is zero
>>> tolerances: relative=1e-08, absolute=1e-50, divergence=10000
>>> left preconditioning
>>> PC Object:(st_)
>>> type: lu
>>> LU: out-of-place factorization
>>> matrix ordering: natural
>>> LU: tolerance for zero pivot 1e-12
>>> LU: factor fill ratio needed 0
>>> Factored matrix follows
>>> Matrix Object:
>>> type=mpiaij, rows=58035, cols=58035
>>> package used to perform factorization: superlu_dist
>>> total: nonzeros=0, allocated nonzeros=116070
>>> SuperLU_DIST run parameters:
>>> Process grid nprow 2 x npcol 2
>>> Equilibrate matrix TRUE
>>> Matrix input mode 1
>>> Replace tiny pivots TRUE
>>> Use iterative refinement FALSE
>>> Processors in row 2 col partition 2
>>> Row permutation LargeDiag
>>> Column permutation PARMETIS
>>> Parallel symbolic factorization TRUE
>>> Repeated factorization SamePattern
>>> linear system matrix = precond matrix:
>>> Matrix Object:
>>> type=mpiaij, rows=58035, cols=58035
>>> total: nonzeros=2223621, allocated nonzeros=2233584
>>> using I-node (on process 0) routines: found 4695 nodes,
>>> limit used is 5
>>> ------------------------------
>>> Number of iterations in the eigensolver: 1
>>> Number of requested eigenvalues: 4
>>> Stopping condition: tol=1e-05, maxit=6108
>>> Number of converged eigenpairs: 8
>>>
>>> Writing binary .vtu file /scratch/fredrik/output/mode-0.vtu
>>> Writing binary .vtu file /scratch/fredrik/output/mode-1.vtu
>>> Writing binary .vtu file /scratch/fredrik/output/mode-2.vtu
>>> Writing binary .vtu file /scratch/fredrik/output/mode-3.vtu
>>> Writing binary .vtu file /scratch/fredrik/output/mode-4.vtu
>>> Writing binary .vtu file /scratch/fredrik/output/mode-5.vtu
>>> Writing binary .vtu file /scratch/fredrik/output/mode-6.vtu
>>> Writing binary .vtu file /scratch/fredrik/output/mode-7.vtu
>>> ************************************************************************************************************************
>>> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use
>>> 'enscript -r -fCourier9' to print this document ***
>>> ************************************************************************************************************************
>>>
>>> ---------------------------------------------- PETSc Performance
>>> Summary: ----------------------------------------------
>>>
>>> /home/fredrik/Hakan/cmlfet/a.out on a linux-gnu named medusa1 with
>>> 4 processors, by fredrik Fri May 8 17:57:28 2009
>>> Using Petsc Release Version 3.0.0, Patch 5, Mon Apr 13 09:15:37
>>> CDT 2009
>>>
>>> Max Max/Min Avg Total
>>> Time (sec): 5.429e+02 1.00001 5.429e+02
>>> Objects: 1.380e+02 1.00000 1.380e+02
>>> Flops: 1.053e+08 1.05695 1.028e+08 4.114e+08
>>> Flops/sec: 1.939e+05 1.05696 1.894e+05 7.577e+05
>>> Memory: 5.927e+07 1.03224 2.339e+08
>>> MPI Messages: 2.880e+02 1.51579 2.535e+02 1.014e+03
>>> MPI Message Lengths: 4.868e+07 1.08170 1.827e+05 1.853e+08
>>> MPI Reductions: 1.122e+02 1.00000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>> e.g., VecAXPY() for real vectors of
>>> length N --> 2N flops
>>> and VecAXPY() for complex vectors of
>>> length N --> 8N flops
>>>
>>> Summary of Stages: ----- Time ------ ----- Flops ----- ---
>>> Messages --- -- Message Lengths -- -- Reductions --
>>> Avg %Total Avg %Total counts
>>> %Total Avg %Total counts %Total
>>> 0: Main Stage: 5.4292e+02 100.0% 4.1136e+08 100.0% 1.014e
>>> +03 100.0% 1.827e+05 100.0% 3.600e+02 80.2%
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.
>>> Phase summary info:
>>> Count: number of times phase was executed
>>> Time and Flops: Max - maximum over all processors
>>> Ratio - ratio of maximum to minimum over all
>>> processors
>>> Mess: number of messages sent
>>> Avg. len: average message length
>>> Reduct: number of global reductions
>>> Global: entire computation
>>> Stage: stages of a computation. Set stages with
>>> PetscLogStagePush() and PetscLogStagePop().
>>> %T - percent time in this phase %F - percent flops in
>>> this phase
>>> %M - percent messages in this phase %L - percent message
>>> lengths in this phase
>>> %R - percent reductions in this phase
>>> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max
>>> time over all processors)
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>> ##########################################################
>>> # #
>>> # WARNING!!! #
>>> # #
>>> # This code was compiled with a debugging option, #
>>> # To get timing results run config/configure.py #
>>> # using --with-debugging=no, the performance will #
>>> # be generally two or three times faster. #
>>> # #
>>> ##########################################################
>>>
>>>
>>> Event Count Time (sec)
>>> Flops --- Global --- --- Stage ---
>>> Total
>>> Max Ratio Max Ratio Max Ratio Mess Avg
>>> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> STSetUp 1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 8.0e+00 2 0 0 0 2 2 0 0 0 2 0
>>> STApply 28 1.0 5.1775e+02 1.0 3.15e+07 1.1 1.7e+02
>>> 4.2e+03 2.8e+01 95 30 17 0 6 95 30 17 0 8 0
>>> EPSSetUp 1 1.0 1.0482e+01 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 4.6e+01 2 0 0 0 10 2 0 0 0 13 0
>>> EPSSolve 1 1.0 3.7193e+02 1.0 9.59e+07 1.1 3.5e+02
>>> 4.2e+03 9.7e+01 69 91 35 1 22 69 91 35 1 27 1
>>> IPOrthogonalize 19 1.0 3.4406e-01 1.1 6.75e+07 1.1 2.3e+02
>>> 4.2e+03 7.6e+01 0 64 22 1 17 0 64 22 1 21 767
>>> IPInnerProduct 153 1.0 3.1410e-01 1.0 5.63e+07 1.1 2.3e+02
>>> 4.2e+03 3.9e+01 0 53 23 1 9 0 53 23 1 11 700
>>> IPApplyMatrix 39 1.0 2.4903e-01 1.1 4.38e+07 1.1 2.3e+02
>>> 4.2e+03 0.0e+00 0 42 23 1 0 0 42 23 1 0 687
>>> UpdateVectors 1 1.0 4.2958e-03 1.2 4.51e+06 1.1 0.0e+00
>>> 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 4107
>>> VecDot 1 1.0 5.6815e-04 4.7 2.97e+04 1.1 0.0e+00
>>> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 204
>>> VecNorm 8 1.0 2.5260e-03 3.2 2.38e+05 1.1 0.0e+00
>>> 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 368
>>> VecScale 27 1.0 5.9605e-04 1.1 4.01e+05 1.1 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2629
>>> VecCopy 53 1.0 4.0610e-03 1.4 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecSet 77 1.0 6.2165e-03 1.1 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecAXPY 38 1.0 2.7709e-03 1.7 1.13e+06 1.1 0.0e+00
>>> 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1592
>>> VecMAXPY 38 1.0 2.5925e-02 1.1 1.13e+07 1.1 0.0e+00
>>> 0.0e+00 0.0e+00 0 11 0 0 0 0 11 0 0 0 1701
>>> VecAssemblyBegin 5 1.0 9.0070e-03 2.3 0.00e+00 0.0 3.6e+01
>>> 2.1e+04 1.5e+01 0 0 4 0 3 0 0 4 0 4 0
>>> VecAssemblyEnd 5 1.0 3.4809e-04 1.1 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecScatterBegin 73 1.0 8.5931e-03 1.5 0.00e+00 0.0 4.6e+02
>>> 8.9e+03 0.0e+00 0 0 45 2 0 0 0 45 2 0 0
>>> VecScatterEnd 73 1.0 2.2542e-02 2.2 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecReduceArith 76 1.0 3.0838e-02 1.1 1.24e+07 1.1 0.0e+00
>>> 0.0e+00 0.0e+00 0 12 0 0 0 0 12 0 0 0 1573
>>> VecReduceComm 38 1.0 4.8040e-02 2.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 3.8e+01 0 0 0 0 8 0 0 0 0 11 0
>>> VecNormalize 8 1.0 2.7280e-03 2.8 3.56e+05 1.1 0.0e+00
>>> 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 511
>>> MatMult 67 1.0 4.1397e-01 1.1 7.53e+07 1.1 4.0e+02
>>> 4.2e+03 0.0e+00 0 71 40 1 0 0 71 40 1 0 710
>>> MatSolve 28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 95 0 0 0 0 95 0 0 0 0 0
>>> MatLUFactorSym 1 1.0 3.6097e-04 1.1 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatLUFactorNum 1 1.0 1.0464e+01 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
>>> MatAssemblyBegin 9 1.0 3.3842e-0146.7 0.00e+00 0.0 5.4e+01
>>> 6.0e+04 8.0e+00 0 0 5 2 2 0 0 5 2 2 0
>>> MatAssemblyEnd 9 1.0 2.3042e-01 1.0 0.00e+00 0.0 3.6e+01
>>> 9.4e+02 3.1e+01 0 0 4 0 7 0 0 4 0 9 0
>>> MatGetRow 5206 1.1 3.1164e-03 1.1 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatGetSubMatrice 5 1.0 8.7580e-01 1.2 0.00e+00 0.0 1.5e+02
>>> 1.1e+06 2.5e+01 0 0 15 88 6 0 0 15 88 7 0
>>> MatZeroEntries 2 1.0 1.0233e-02 1.1 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatView 2 1.0 1.0149e-03 2.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 1 0
>>> KSPSetup 1 1.0 2.8610e-06 1.5 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> KSPSolve 28 1.0 5.1758e+02 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 2.8e+01 95 0 0 0 6 95 0 0 0 8 0
>>> PCSetUp 1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 8.0e+00 2 0 0 0 2 2 0 0 0 2 0
>>> PCApply 28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 95 0 0 0 0 95 0 0 0 0 0
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type Creations Destructions Memory
>>> Descendants' Mem.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> Spectral Transform 1 1 536 0
>>> Eigenproblem Solver 1 1 824 0
>>> Inner product 1 1 428 0
>>> Index Set 38 38 1796776 0
>>> IS L to G Mapping 1 1 58700 0
>>> Vec 65 65 5458584 0
>>> Vec Scatter 9 9 7092 0
>>> Application Order 1 1 155232 0
>>> Matrix 17 16 17715680 0
>>> Krylov Solver 1 1 832 0
>>> Preconditioner 1 1 744 0
>>> Viewer 2 2 1088 0
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> ====================================================================
>>> Average time to get PetscTime(): 1.90735e-07
>>> Average time for MPI_Barrier(): 5.9557e-05
>>> Average time for zero size MPI_Send(): 2.97427e-05
>>> #PETSc Option Table entries:
>>> -log_summary
>>> -mat_superlu_dist_parsymbfact
>>> #End o PETSc Option Table entries
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 8
>>> Configure run at: Wed May 6 15:14:39 2009
>>> Configure options: --download-superlu_dist=1 --download-parmetis=1
>>> --with-mpi-dir=/usr/lib/mpich --with-shared=0
>>> -----------------------------------------
>>> Libraries compiled on Wed May 6 15:14:49 CEST 2009 on medusa1
>>> Machine characteristics: Linux medusa1 2.6.18-6-amd64 #1 SMP Fri
>>> Dec 12 05:49:32 UTC 2008 x86_64 GNU/Linux
>>> Using PETSc directory: /home/fredrik/Hakan/cmlfet/external/
>>> petsc-3.0.0-p5
>>> Using PETSc arch: linux-gnu-c-debug
>>> -----------------------------------------
>>> Using C compiler: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings -
>>> Wno-strict-aliasing -g3 Using Fortran compiler: /usr/lib/mpich/
>>> bin/mpif77 -Wall -Wno-unused-variable -g
>>> -----------------------------------------
>>> Using include paths: -I/home/fredrik/Hakan/cmlfet/external/
>>> petsc-3.0.0-p5/linux-gnu-c-debug/include -I/home/fredrik/Hakan/
>>> cmlfet/external/petsc-3.0.0-p5/include -I/home/fredrik/Hakan/
>>> cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include -I/usr/
>>> lib/mpich/include ------------------------------------------
>>> Using C linker: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings -
>>> Wno-strict-aliasing -g3
>>> Using Fortran linker: /usr/lib/mpich/bin/mpif77 -Wall -Wno-unused-
>>> variable -g Using libraries: -Wl,-rpath,/home/fredrik/Hakan/cmlfet/
>>> external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -L/home/fredrik/
>>> Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -
>>> lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -
>>> lpetsc -lX11 -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/
>>> petsc-3.0.0-p5/linux-gnu-c-debug/lib -L/home/fredrik/Hakan/cmlfet/
>>> external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -lsuperlu_dist_2.3 -
>>> llapack -lblas -lparmetis -lmetis -lm -L/usr/lib/mpich/lib -L/usr/
>>> lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib64 -L/lib64 -ldl -lmpich -
>>> lpthread -lrt -lgcc_s -lg2c -lm -L/usr/lib/gcc/x86_64-linux-gnu/
>>> 3.4.6 -L/lib -lm -ldl -lmpich -lpthread -lrt -lgcc_s -ldl
>>> ------------------------------------------
>>>
>>> real 9m10.616s
>>> user 0m23.921s
>>> sys 0m6.944s
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Satish Balay wrote:
>>> Just a note about scalability: its a function of the hardware as
>>> well.. For proper scalability studies - you'll need a true
>>> distributed
>>> system with fast network [not SMP nodes..]
>>>
>>> Satish
>>>
>>> On Fri, 8 May 2009, Fredrik Bengzon wrote:
>>>
>>>
>>> Hong,
>>> Thank you for the suggestions, but I have looked at the EPS and
>>> KSP objects
>>> and I can not find anything wrong. The problem is that it takes
>>> longer to
>>> solve with 4 cpus than with 2 so the scalability seems to be
>>> absent when using
>>> superlu_dist. I have stored my mass and stiffness matrix in the
>>> mpiaij format
>>> and just passed them on to slepc. When using the petsc iterative
>>> krylov
>>> solvers i see 100% workload on all processors but when i switch to
>>> superlu_dist only two cpus seem to do the whole work of LU
>>> factoring. I don't
>>> want to use the krylov solver though since it might cause slepc
>>> not to
>>> converge.
>>> Regards,
>>> Fredrik
>>>
>>> Hong Zhang wrote:
>>>
>>> Run your code with '-eps_view -ksp_view' for checking
>>> which methods are used
>>> and '-log_summary' to see which operations dominate
>>> the computation.
>>>
>>> You can turn on parallel symbolic factorization
>>> with '-mat_superlu_dist_parsymbfact'.
>>>
>>> Unless you use large num of processors, symbolic factorization
>>> takes ignorable execution time. The numeric
>>> factorization usually dominates.
>>>
>>> Hong
>>>
>>> On Fri, 8 May 2009, Fredrik Bengzon wrote:
>>>
>>>
>>> Hi Petsc team,
>>> Sorry for posting questions not really concerning the petsc core,
>>> but when
>>> I run superlu_dist from within slepc I notice that the load
>>> balance is
>>> poor. It is just fine during assembly (I use Metis to partition my
>>> finite
>>> element mesh) but when calling the slepc solver it dramatically
>>> changes. I
>>> use superlu_dist as solver for the eigenvalue iteration. My
>>> question is:
>>> can this have something to do with the fact that the option
>>> 'Parallel
>>> symbolic factorization' is set to false? If so, can I change the
>>> options
>>> to superlu_dist using MatSetOption for instance? Also, does this
>>> mean that
>>> superlu_dist is not using parmetis to reorder the matrix?
>>> Best Regards,
>>> Fredrik Bengzon
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to
>>> which their experiments lead.
>>> -- Norbert Wiener
>>
>>
>
More information about the petsc-users
mailing list