Look at the timing. The symbolic factorization takes 1e-4 seconds and the numeric takes<br>only 10s, out of 542s. MatSolve is taking 517s. If you have a problem, it is likely there.<br>However, the MatSolve looks balanced.<br>
<br> Matt<br><br><div class="gmail_quote">On Fri, May 8, 2009 at 10:59 AM, Fredrik Bengzon <span dir="ltr"><<a href="mailto:fredrik.bengzon@math.umu.se">fredrik.bengzon@math.umu.se</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi,<br>
Here is the output from the KSP and EPS objects, and the log summary.<br>
/ Fredrik<br>
<br>
<br>
Reading Triangle/Tetgen mesh<br>
#nodes=19345<br>
#elements=81895<br>
#nodes per element=4<br>
Partitioning mesh with METIS 4.0<br>
Element distribution (rank | #elements)<br>
0 | 19771<br>
1 | 20954<br>
2 | 20611<br>
3 | 20559<br>
rank 1 has 257 ghost nodes<br>
rank 0 has 127 ghost nodes<br>
rank 2 has 143 ghost nodes<br>
rank 3 has 270 ghost nodes<br>
Calling 3D Navier-Lame Eigenvalue Solver<br>
Assembling stiffness and mass matrix<br>
Solving eigensystem with SLEPc<br>
KSP Object:(st_)<br>
type: preonly<br>
maximum iterations=100000, initial guess is zero<br>
tolerances: relative=1e-08, absolute=1e-50, divergence=10000<br>
left preconditioning<br>
PC Object:(st_)<br>
type: lu<br>
LU: out-of-place factorization<br>
matrix ordering: natural<br>
LU: tolerance for zero pivot 1e-12<br>
EPS Object:<br>
problem type: generalized symmetric eigenvalue problem<br>
method: krylovschur<br>
extraction type: Rayleigh-Ritz<br>
selected portion of the spectrum: largest eigenvalues in magnitude<br>
number of eigenvalues (nev): 4<br>
number of column vectors (ncv): 19<br>
maximum dimension of projected problem (mpd): 19<br>
maximum number of iterations: 6108<br>
tolerance: 1e-05<br>
dimension of user-provided deflation space: 0<br>
IP Object:<br>
orthogonalization method: classical Gram-Schmidt<br>
orthogonalization refinement: if needed (eta: 0.707100)<br>
ST Object:<br>
type: sinvert<br>
shift: 0<br>
Matrices A and B have same nonzero pattern<br>
Associated KSP object<br>
------------------------------<br>
KSP Object:(st_)<br>
type: preonly<br>
maximum iterations=100000, initial guess is zero<br>
tolerances: relative=1e-08, absolute=1e-50, divergence=10000<br>
left preconditioning<br>
PC Object:(st_)<br>
type: lu<br>
LU: out-of-place factorization<br>
matrix ordering: natural<br>
LU: tolerance for zero pivot 1e-12<br>
LU: factor fill ratio needed 0<br>
Factored matrix follows<br>
Matrix Object:<br>
type=mpiaij, rows=58035, cols=58035<br>
package used to perform factorization: superlu_dist<br>
total: nonzeros=0, allocated nonzeros=116070<br>
SuperLU_DIST run parameters:<br>
Process grid nprow 2 x npcol 2<br>
Equilibrate matrix TRUE<br>
Matrix input mode 1<br>
Replace tiny pivots TRUE<br>
Use iterative refinement FALSE<br>
Processors in row 2 col partition 2<br>
Row permutation LargeDiag<br>
Column permutation PARMETIS<br>
Parallel symbolic factorization TRUE<br>
Repeated factorization SamePattern<br>
linear system matrix = precond matrix:<br>
Matrix Object:<br>
type=mpiaij, rows=58035, cols=58035<br>
total: nonzeros=2223621, allocated nonzeros=2233584<br>
using I-node (on process 0) routines: found 4695 nodes, limit used is 5<br>
------------------------------<br>
Number of iterations in the eigensolver: 1<br>
Number of requested eigenvalues: 4<br>
Stopping condition: tol=1e-05, maxit=6108<br>
Number of converged eigenpairs: 8<br>
<br>
Writing binary .vtu file /scratch/fredrik/output/mode-0.vtu<br>
Writing binary .vtu file /scratch/fredrik/output/mode-1.vtu<br>
Writing binary .vtu file /scratch/fredrik/output/mode-2.vtu<br>
Writing binary .vtu file /scratch/fredrik/output/mode-3.vtu<br>
Writing binary .vtu file /scratch/fredrik/output/mode-4.vtu<br>
Writing binary .vtu file /scratch/fredrik/output/mode-5.vtu<br>
Writing binary .vtu file /scratch/fredrik/output/mode-6.vtu<br>
Writing binary .vtu file /scratch/fredrik/output/mode-7.vtu<br>
************************************************************************************************************************<br>
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***<br>
************************************************************************************************************************<br>
<br>
---------------------------------------------- PETSc Performance Summary: ----------------------------------------------<br>
<br>
/home/fredrik/Hakan/cmlfet/a.out on a linux-gnu named medusa1 with 4 processors, by fredrik Fri May 8 17:57:28 2009<br>
Using Petsc Release Version 3.0.0, Patch 5, Mon Apr 13 09:15:37 CDT 2009<br>
<br>
Max Max/Min Avg Total<br>
Time (sec): 5.429e+02 1.00001 5.429e+02<br>
Objects: 1.380e+02 1.00000 1.380e+02<br>
Flops: 1.053e+08 1.05695 1.028e+08 4.114e+08<br>
Flops/sec: 1.939e+05 1.05696 1.894e+05 7.577e+05<br>
Memory: 5.927e+07 1.03224 2.339e+08<br>
MPI Messages: 2.880e+02 1.51579 2.535e+02 1.014e+03<br>
MPI Message Lengths: 4.868e+07 1.08170 1.827e+05 1.853e+08<br>
MPI Reductions: 1.122e+02 1.00000<br>
<br>
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)<br>
e.g., VecAXPY() for real vectors of length N --> 2N flops<br>
and VecAXPY() for complex vectors of length N --> 8N flops<br>
<br>
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --<br>
Avg %Total Avg %Total counts %Total Avg %Total counts %Total<br>
0: Main Stage: 5.4292e+02 100.0% 4.1136e+08 100.0% 1.014e+03 100.0% 1.827e+05 100.0% 3.600e+02 80.2%<br>
<br>
------------------------------------------------------------------------------------------------------------------------<br>
See the 'Profiling' chapter of the users' manual for details on interpreting output.<br>
Phase summary info:<br>
Count: number of times phase was executed<br>
Time and Flops: Max - maximum over all processors<br>
Ratio - ratio of maximum to minimum over all processors<br>
Mess: number of messages sent<br>
Avg. len: average message length<br>
Reduct: number of global reductions<br>
Global: entire computation<br>
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().<br>
%T - percent time in this phase %F - percent flops in this phase<br>
%M - percent messages in this phase %L - percent message lengths in this phase<br>
%R - percent reductions in this phase<br>
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
<br>
##########################################################<br>
# #<br>
# WARNING!!! #<br>
# #<br>
# This code was compiled with a debugging option, #<br>
# To get timing results run config/configure.py #<br>
# using --with-debugging=no, the performance will #<br>
# be generally two or three times faster. #<br>
# #<br>
##########################################################<br>
<br>
<br>
Event Count Time (sec) Flops --- Global --- --- Stage --- Total<br>
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
--- Event Stage 0: Main Stage<br>
<br>
STSetUp 1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00 2 0 0 0 2 2 0 0 0 2 0<br>
STApply 28 1.0 5.1775e+02 1.0 3.15e+07 1.1 1.7e+02 4.2e+03 2.8e+01 95 30 17 0 6 95 30 17 0 8 0<br>
EPSSetUp 1 1.0 1.0482e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.6e+01 2 0 0 0 10 2 0 0 0 13 0<br>
EPSSolve 1 1.0 3.7193e+02 1.0 9.59e+07 1.1 3.5e+02 4.2e+03 9.7e+01 69 91 35 1 22 69 91 35 1 27 1<br>
IPOrthogonalize 19 1.0 3.4406e-01 1.1 6.75e+07 1.1 2.3e+02 4.2e+03 7.6e+01 0 64 22 1 17 0 64 22 1 21 767<br>
IPInnerProduct 153 1.0 3.1410e-01 1.0 5.63e+07 1.1 2.3e+02 4.2e+03 3.9e+01 0 53 23 1 9 0 53 23 1 11 700<br>
IPApplyMatrix 39 1.0 2.4903e-01 1.1 4.38e+07 1.1 2.3e+02 4.2e+03 0.0e+00 0 42 23 1 0 0 42 23 1 0 687<br>
UpdateVectors 1 1.0 4.2958e-03 1.2 4.51e+06 1.1 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 4107<br>
VecDot 1 1.0 5.6815e-04 4.7 2.97e+04 1.1 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 204<br>
VecNorm 8 1.0 2.5260e-03 3.2 2.38e+05 1.1 0.0e+00 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 368<br>
VecScale 27 1.0 5.9605e-04 1.1 4.01e+05 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2629<br>
VecCopy 53 1.0 4.0610e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
VecSet 77 1.0 6.2165e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
VecAXPY 38 1.0 2.7709e-03 1.7 1.13e+06 1.1 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1592<br>
VecMAXPY 38 1.0 2.5925e-02 1.1 1.13e+07 1.1 0.0e+00 0.0e+00 0.0e+00 0 11 0 0 0 0 11 0 0 0 1701<br>
VecAssemblyBegin 5 1.0 9.0070e-03 2.3 0.00e+00 0.0 3.6e+01 2.1e+04 1.5e+01 0 0 4 0 3 0 0 4 0 4 0<br>
VecAssemblyEnd 5 1.0 3.4809e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
VecScatterBegin 73 1.0 8.5931e-03 1.5 0.00e+00 0.0 4.6e+02 8.9e+03 0.0e+00 0 0 45 2 0 0 0 45 2 0 0<br>
VecScatterEnd 73 1.0 2.2542e-02 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
VecReduceArith 76 1.0 3.0838e-02 1.1 1.24e+07 1.1 0.0e+00 0.0e+00 0.0e+00 0 12 0 0 0 0 12 0 0 0 1573<br>
VecReduceComm 38 1.0 4.8040e-02 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.8e+01 0 0 0 0 8 0 0 0 0 11 0<br>
VecNormalize 8 1.0 2.7280e-03 2.8 3.56e+05 1.1 0.0e+00 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 511<br>
MatMult 67 1.0 4.1397e-01 1.1 7.53e+07 1.1 4.0e+02 4.2e+03 0.0e+00 0 71 40 1 0 0 71 40 1 0 710<br>
MatSolve 28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 95 0 0 0 0 95 0 0 0 0 0<br>
MatLUFactorSym 1 1.0 3.6097e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
MatLUFactorNum 1 1.0 1.0464e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0<br>
MatAssemblyBegin 9 1.0 3.3842e-0146.7 0.00e+00 0.0 5.4e+01 6.0e+04 8.0e+00 0 0 5 2 2 0 0 5 2 2 0<br>
MatAssemblyEnd 9 1.0 2.3042e-01 1.0 0.00e+00 0.0 3.6e+01 9.4e+02 3.1e+01 0 0 4 0 7 0 0 4 0 9 0<br>
MatGetRow 5206 1.1 3.1164e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
MatGetSubMatrice 5 1.0 8.7580e-01 1.2 0.00e+00 0.0 1.5e+02 1.1e+06 2.5e+01 0 0 15 88 6 0 0 15 88 7 0<br>
MatZeroEntries 2 1.0 1.0233e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
MatView 2 1.0 1.0149e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 1 0<br>
KSPSetup 1 1.0 2.8610e-06 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
KSPSolve 28 1.0 5.1758e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.8e+01 95 0 0 0 6 95 0 0 0 8 0<br>
PCSetUp 1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00 2 0 0 0 2 2 0 0 0 2 0<br>
PCApply 28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 95 0 0 0 0 95 0 0 0 0 0<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
Memory usage is given in bytes:<br>
<br>
Object Type Creations Destructions Memory Descendants' Mem.<br>
<br>
--- Event Stage 0: Main Stage<br>
<br>
Spectral Transform 1 1 536 0<br>
Eigenproblem Solver 1 1 824 0<br>
Inner product 1 1 428 0<br>
Index Set 38 38 1796776 0<br>
IS L to G Mapping 1 1 58700 0<br>
Vec 65 65 5458584 0<br>
Vec Scatter 9 9 7092 0<br>
Application Order 1 1 155232 0<br>
Matrix 17 16 17715680 0<br>
Krylov Solver 1 1 832 0<br>
Preconditioner 1 1 744 0<br>
Viewer 2 2 1088 0<br>
========================================================================================================================<br>
Average time to get PetscTime(): 1.90735e-07<br>
Average time for MPI_Barrier(): 5.9557e-05<br>
Average time for zero size MPI_Send(): 2.97427e-05<br>
#PETSc Option Table entries:<br>
-log_summary<br>
-mat_superlu_dist_parsymbfact<br>
#End o PETSc Option Table entries<br>
Compiled without FORTRAN kernels<br>
Compiled with full precision matrices (default)<br>
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8<br>
Configure run at: Wed May 6 15:14:39 2009<br>
Configure options: --download-superlu_dist=1 --download-parmetis=1 --with-mpi-dir=/usr/lib/mpich --with-shared=0<br>
-----------------------------------------<br>
Libraries compiled on Wed May 6 15:14:49 CEST 2009 on medusa1<br>
Machine characteristics: Linux medusa1 2.6.18-6-amd64 #1 SMP Fri Dec 12 05:49:32 UTC 2008 x86_64 GNU/Linux<br>
Using PETSc directory: /home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5<br>
Using PETSc arch: linux-gnu-c-debug<br>
-----------------------------------------<br>
Using C compiler: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings -Wno-strict-aliasing -g3 Using Fortran compiler: /usr/lib/mpich/bin/mpif77 -Wall -Wno-unused-variable -g -----------------------------------------<br>
Using include paths: -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/include -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include -I/usr/lib/mpich/include ------------------------------------------<br>
Using C linker: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings -Wno-strict-aliasing -g3<br>
Using Fortran linker: /usr/lib/mpich/bin/mpif77 -Wall -Wno-unused-variable -g Using libraries: -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc -lX11 -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -lsuperlu_dist_2.3 -llapack -lblas -lparmetis -lmetis -lm -L/usr/lib/mpich/lib -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib64 -L/lib64 -ldl -lmpich -lpthread -lrt -lgcc_s -lg2c -lm -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/lib -lm -ldl -lmpich -lpthread -lrt -lgcc_s -ldl<br>
------------------------------------------<br>
<br>
real 9m10.616s<br>
user 0m23.921s<br>
sys 0m6.944s<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
Satish Balay wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Just a note about scalability: its a function of the hardware as<br>
well.. For proper scalability studies - you'll need a true distributed<br>
system with fast network [not SMP nodes..]<br>
<br>
Satish<br>
<br>
On Fri, 8 May 2009, Fredrik Bengzon wrote:<br>
<br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hong,<br>
Thank you for the suggestions, but I have looked at the EPS and KSP objects<br>
and I can not find anything wrong. The problem is that it takes longer to<br>
solve with 4 cpus than with 2 so the scalability seems to be absent when using<br>
superlu_dist. I have stored my mass and stiffness matrix in the mpiaij format<br>
and just passed them on to slepc. When using the petsc iterative krylov<br>
solvers i see 100% workload on all processors but when i switch to<br>
superlu_dist only two cpus seem to do the whole work of LU factoring. I don't<br>
want to use the krylov solver though since it might cause slepc not to<br>
converge.<br>
Regards,<br>
Fredrik<br>
<br>
Hong Zhang wrote:<br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Run your code with '-eps_view -ksp_view' for checking<br>
which methods are used<br>
and '-log_summary' to see which operations dominate<br>
the computation.<br>
<br>
You can turn on parallel symbolic factorization<br>
with '-mat_superlu_dist_parsymbfact'.<br>
<br>
Unless you use large num of processors, symbolic factorization<br>
takes ignorable execution time. The numeric<br>
factorization usually dominates.<br>
<br>
Hong<br>
<br>
On Fri, 8 May 2009, Fredrik Bengzon wrote:<br>
<br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi Petsc team,<br>
Sorry for posting questions not really concerning the petsc core, but when<br>
I run superlu_dist from within slepc I notice that the load balance is<br>
poor. It is just fine during assembly (I use Metis to partition my finite<br>
element mesh) but when calling the slepc solver it dramatically changes. I<br>
use superlu_dist as solver for the eigenvalue iteration. My question is:<br>
can this have something to do with the fact that the option 'Parallel<br>
symbolic factorization' is set to false? If so, can I change the options<br>
to superlu_dist using MatSetOption for instance? Also, does this mean that<br>
superlu_dist is not using parmetis to reorder the matrix?<br>
Best Regards,<br>
Fredrik Bengzon<br>
<br>
<br>
<br>
</blockquote></blockquote>
<br>
</blockquote>
<br>
<br>
<br>
</blockquote>
<br>
</blockquote></div><br><br clear="all"><br>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener<br>