[petsc-users] zero pattern of result of matmatmult
Frederik Treue
frtr at fysik.dtu.dk
Tue Sep 17 08:57:22 CDT 2013
On Tue, 2013-09-17 at 08:15 -0500, Barry Smith wrote:
> When the linear system is fairly easy it is possible for the construction of the matrix to take a significant portion of the total time. Run with -log_summary and send the output, from this we may be able to make suggestions as to what further improvements can be made.
Attached.
>
> Is any part of the matrix constant? If so you can use MatStoreValues() and MatRetrieveValues() so that part does not have to be recomputed each time.
Not in terms of entries in the matrix. Mathematically, part of the
operator is constant, but all entries has changing parts added to them.
>
> Are you calling MatSetValuesStencil() for each single entry or once per row or block row? Row or block row will be faster than for each entry.
once per entry - this I should definitely improve. What's the logic
behind row or block rows being faster? The fewer calls to
MatSetValuesStencil, the better?
>
> Since you are using MatSetValuesStencil() I assume you are using a DA?
I am, with one DOF per point. Are there better alternatives?
PS. FYI: My stencil is essentially a standard 2D, 9-point box stencil,
but the values in all rows are different (due, among other things, to
the geometric tensor).
Thanks for the help...
/Frederik
-------------- next part --------------
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
./petsclapl on a arch-linux2-c-debug named frtr-laptop with 2 processors, by frtr Tue Sep 17 14:24:35 2013
Using Petsc Release Version 3.4.1, Jun, 10, 2013
Max Max/Min Avg Total
Time (sec): 4.045e+02 1.00001 4.045e+02
Objects: 1.336e+03 1.00000 1.336e+03
Flops: 8.196e+08 1.00000 8.196e+08 1.639e+09
Flops/sec: 2.026e+06 1.00001 2.026e+06 4.052e+06
Memory: 2.755e+07 1.00000 5.510e+07
MPI Messages: 3.902e+03 1.00000 3.902e+03 7.804e+03
MPI Message Lengths: 6.317e+06 1.00000 1.619e+03 1.263e+07
MPI Reductions: 3.193e+04 1.00000
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flops
and VecAXPY() for complex vectors of length N --> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total Avg %Total counts %Total
0: Main Stage: 4.0449e+02 100.0% 1.6392e+09 100.0% 7.804e+03 100.0% 1.619e+03 100.0% 3.193e+04 100.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %f - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
ThreadCommRunKer 1 1.0 9.8039e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
ThreadCommBarrie 1 1.0 6.7013e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecMDot 250 1.0 9.3867e-01 1.1 4.59e+06 1.0 0.0e+00 0.0e+00 2.5e+02 0 1 0 0 1 0 1 0 0 1 10
VecTDot 5782 1.0 5.8286e+01 1.0 5.90e+07 1.0 0.0e+00 0.0e+00 5.8e+03 14 7 0 0 18 14 7 0 0 18 2
VecNorm 3341 1.0 1.1062e+01 1.1 3.41e+07 1.0 0.0e+00 0.0e+00 3.3e+03 3 4 0 0 10 3 4 0 0 10 6
VecScale 452 1.0 3.6935e+00 1.0 2.31e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 1
VecCopy 308 1.0 1.0530e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 3756 1.0 7.7467e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 6885 1.0 4.3978e+01 1.0 6.97e+07 1.0 0.0e+00 0.0e+00 0.0e+00 11 9 0 0 0 11 9 0 0 0 3
VecAYPX 2891 1.0 1.2989e+00 1.0 2.92e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 45
VecMAXPY 350 1.0 1.7310e-01 1.2 7.14e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 82
VecAssemblyBegin 36 1.0 2.2704e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 9.9e+01 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 36 1.0 5.6790e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecPointwiseMult 491 1.0 1.9725e-01 1.1 2.50e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 25
VecScatterBegin 3900 1.0 1.3175e+00 1.0 0.00e+00 0.0 7.6e+03 1.6e+03 0.0e+00 0 0 97 99 0 0 0 97 99 0 0
VecScatterEnd 3900 1.0 3.3521e+00 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecNormalize 350 1.0 3.4656e+00 1.0 5.36e+06 1.0 0.0e+00 0.0e+00 3.5e+02 1 1 0 0 1 1 1 0 0 1 3
MatMult 3600 1.0 2.0704e+01 1.0 3.10e+08 1.0 7.2e+03 1.6e+03 0.0e+00 5 38 92 93 0 5 38 92 93 0 30
MatSolve 3341 1.0 1.2289e+01 1.0 2.84e+08 1.0 0.0e+00 0.0e+00 0.0e+00 3 35 0 0 0 3 35 0 0 0 46
MatLUFactorNum 56 1.0 3.4288e+00 1.0 7.72e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1 1 0 0 0 5
MatILUFactorSym 56 1.0 9.8914e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 5.6e+01 0 0 0 0 0 0 0 0 0 0 0
MatCopy 212 1.0 1.9453e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 7.2e+01 5 0 0 0 0 5 0 0 0 0 0
MatScale 176 1.0 5.4831e-01 1.0 8.07e+06 1.0 0.0e+00 0.0e+00 2.2e+01 0 1 0 0 0 0 1 0 0 0 29
MatAssemblyBegin 85 1.0 1.5119e+00 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.7e+02 0 0 0 0 1 0 0 0 0 1 0
MatAssemblyEnd 85 1.0 1.5816e+00 1.0 0.00e+00 0.0 1.9e+02 4.9e+02 1.1e+03 0 0 2 1 4 0 0 2 1 4 0
MatGetRow 285600 1.0 2.5095e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0 0 0
MatGetRowIJ 56 1.0 4.7718e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 56 1.0 6.4522e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.1e+02 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 18 1.0 1.8103e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 174 1.0 7.3996e+01 1.0 0.00e+00 0.0 8.8e+01 4.1e+02 1.0e+03 18 0 1 0 3 18 0 1 0 3 0
MatMatMult 1 1.0 9.7952e-01 1.0 8.17e+05 1.0 1.2e+01 6.5e+03 6.2e+01 0 0 0 1 0 0 0 0 1 0 2
MatMatMultSym 1 1.0 8.2768e-01 1.0 0.00e+00 0.0 1.0e+01 4.9e+03 5.6e+01 0 0 0 0 0 0 0 0 0 0 0
MatMatMultNum 1 1.0 1.5126e-01 1.0 8.17e+05 1.0 2.0e+00 1.5e+04 6.0e+00 0 0 0 0 0 0 0 0 0 0 11
MatGetLocalMat 2 1.0 3.4862e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetBrAoCol 2 1.0 1.4985e-02 1.0 0.00e+00 0.0 8.0e+00 9.3e+03 6.0e+00 0 0 0 1 0 0 0 0 1 0 0
KSPGMRESOrthog 250 1.0 1.2155e+00 1.1 9.18e+06 1.0 0.0e+00 0.0e+00 7.0e+02 0 1 0 0 2 0 1 0 0 2 15
KSPSetUp 112 1.0 6.0548e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 5.4e+02 0 0 0 0 2 0 0 0 0 2 0
KSPSolve 150 1.0 2.3976e+02 1.0 7.62e+08 1.0 6.4e+03 1.6e+03 2.7e+04 59 93 82 82 85 59 93 82 82 85 6
PCSetUp 112 1.0 5.5250e+00 1.0 7.72e+06 1.0 0.0e+00 0.0e+00 7.8e+02 1 1 0 0 2 1 1 0 0 2 3
PCSetUpOnBlocks 150 1.0 5.5561e+00 1.0 7.72e+06 1.0 0.0e+00 0.0e+00 3.4e+02 1 1 0 0 1 1 1 0 0 1 3
PCApply 3341 1.0 4.7031e+01 1.0 2.84e+08 1.0 0.0e+00 0.0e+00 6.7e+03 12 35 0 0 21 12 35 0 0 21 12
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Vector 561 547 14123724 0
Vector Scatter 55 50 31272 0
Matrix 199 216 68839696 0
Distributed Mesh 4 2 5572 0
Bipartite Graph 8 4 1808 0
Index Set 280 280 1354392 0
IS L to G Mapping 4 2 1076 0
Krylov Solver 112 112 181256 0
Preconditioner 112 112 60704 0
Viewer 1 0 0 0
========================================================================================================================
Average time to get PetscTime(): 6.2055e-06
Average time for MPI_Barrier(): 0.000123
Average time for zero size MPI_Send(): 5e-05
#PETSc Option Table entries:
-DIFFUSION 0.1
-DT 0.01
-ET 0.5
-G 2.0
-LOGGAUSSIAN 1,5,15,10.0,2.0
-NODRAW
-OUTSTEP 1
-SQ 30.0,30.0
-SX 100,100,10
-SY 100,100,10
-T 19
-TIMEDEP /home/frtr/tstruns/param.ini
-VISCOSITY 0.1
-log_summary
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure run at: Mon Jul 8 14:50:33 2013
Configure options: --prefix=/home/frtr --download-fftw --download-blacs --download-blas --download-lapack --with-mpi-dir=/home/frtr --download-scalapack --with-debugging=1 COPTFLAGS="-O3 -march=native"
-----------------------------------------
Libraries compiled on Mon Jul 8 14:50:33 2013 on frtr-laptop
Machine characteristics: Linux-3.2.0-49-generic-pae-i686-with-Ubuntu-12.04-precise
Using PETSc directory: /home/frtr/work_lib/petsc-3.4.1
Using PETSc arch: arch-linux2-c-debug
-----------------------------------------
Using C compiler: /home/frtr/bin/mpicc -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O3 -march=native ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: /home/frtr/bin/mpif90 -fPIC -Wall -Wno-unused-variable -Wno-unused-dummy-argument -g ${FOPTFLAGS} ${FFLAGS}
-----------------------------------------
Using include paths: -I/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/include -I/home/frtr/work_lib/petsc-3.4.1/include -I/home/frtr/work_lib/petsc-3.4.1/include -I/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/include -I/home/frtr/include
-----------------------------------------
Using C linker: /home/frtr/bin/mpicc
Using Fortran linker: /home/frtr/bin/mpif90
Using libraries: -Wl,-rpath,/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/lib -L/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/lib -lpetsc -Wl,-rpath,/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/lib -L/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/lib -lscalapack -llapack -lblas -lX11 -lpthread -lfftw3_mpi -lfftw3 -lm -Wl,-rpath,/home/frtr/lib -L/home/frtr/lib -Wl,-rpath,/usr/lib/gcc/i686-linux-gnu/4.6 -L/usr/lib/gcc/i686-linux-gnu/4.6 -Wl,-rpath,/usr/lib/i386-linux-gnu -L/usr/lib/i386-linux-gnu -Wl,-rpath,/lib/i386-linux-gnu -L/lib/i386-linux-gnu -lmpichf90 -lgfortran -lm -lgfortran -lm -lquadmath -lm -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lmpl -lrt -lpthread -lgcc_s -ldl
-----------------------------------------
More information about the petsc-users
mailing list