[petsc-users] LU Performance
Jared Crean
jcrean01 at gmail.com
Fri Jul 5 09:26:58 CDT 2019
This is in reply to both David and Barry's emails.
I am using the Umfpack that Petsc built (--download-suitesparse=yes
was passed to configure), so all the compiler flags and Blas/Lapack
libraries are the same. I used OpenBlas for Blas and Lapack, with
multi-threading disabled. When calling Umfpack directly, the
factorization takes about 4 second, compared to 135 seconds spent in
MatLUFactorNum when using Umfpack via Petsc.
I added a call to umfpack_di_report_control() (which prints the
Umfpack parameters) to my code, and also added -mat_umfpack_prl 2 to the
Petsc options, which should cause Petsc to call the same function just
before doing the symbolic factorization (umfpack.c line 245 in Petsc
3.7.6). The output is attached (also with the -ksp_view option). My
code did print the parameters, but Petsc did not, which makes me think
MatLUFactorSymbolic_UMFPACK never got called. For reference, here is
how I am invoking the program:
./test -ksp_type preonly -pc_type lu -pc_factor_mat_solver_type umfpack
-log_view -ksp_view -mat_umfpack_prl 2 > fout_umfpacklu
Jared Crean
On 7/5/19 4:02 AM, Smith, Barry F. wrote:
> When you use Umfpack standalone do you use OpenMP threads? When you use umfpack alone do you us thread enabled BLAS/LAPACK? Perhaps OpenBLAS or MKL?
>
> You can run both cases with -ksp_view and it will print more details indicating indicating the solver used.
>
> Do you use the same compiler and same options when compiling PETSc and Umfpack standalone. Is the Umfpack standalone time in the numerical factorization much smaller? Perhaps umfpack is using a much better ordering then when used with PETSc (perhaps the default orderings are different).
>
> Does Umfpack has a routine that tiggers output of the parameters etc it is using? If you can trigger it you might see differences between standalone and not.
>
> Barry
>
>
>> On Jul 4, 2019, at 4:05 PM, Jared Crean via petsc-users <petsc-users at mcs.anl.gov> wrote:
>>
>> Hello,
>>
>> I am getting very bad performance from the Umfpack LU solver when I use it via Petsc compared to calling Umfpack directly. It takes about 5.5 seconds to factor and solve the matrix with Umfpack, but 140 seconds when I use Petsc with -ksp_type preonly -pc_type lu -pc_factor_mat_solver_type umfpack.
>>
>> I have attached a minimal example (test.c) that reads a matrix from a file, solves with Umfpack, and then solves with Petsc. The matrix data files are not included because they are about 250 megabytes. I also attached the output of the program with -log_view for -pc_factor_mat_solver_type umfpack (fout_umfpacklu) and -pc_factor_mat_solver_type petsc (fout_petsclu). Both results show nearly all of the time is spent in MatLuFactorNum. The times are very similar, so I am wondering if Petsc is really calling Umfpack or if the Petsc LU solver is getting called in both cases.
>>
>>
>> Jared Crean
>>
>> <test_files.tar.gz>
-------------- next part --------------
First UMFPack solve
reading matrix...finished
UMFPACK V5.7.1 (Oct 10, 2014), Control:
Matrix entry defined as: double
Int (generic integer) defined as: int
0: print level: 2
1: dense row parameter: 0.2
"dense" rows have > max (16, (0.2)*16*sqrt(n_col) entries)
2: dense column parameter: 0.2
"dense" columns have > max (16, (0.2)*16*sqrt(n_row) entries)
3: pivot tolerance: 0.1
4: block size for dense matrix kernels: 32
5: strategy: 0 (auto)
10: ordering: 1 AMD/COLAMD
11: singleton filter: enabled
6: initial allocation ratio: 0.7
7: max iterative refinement steps: 2
13: Q fixed during numerical factorization: 0 (auto)
14: AMD dense row/col parameter: 10
"dense" rows/columns have > max (16, (10)*sqrt(n)) entries
Only used if the AMD ordering is used.
15: diagonal pivot tolerance: 0.001
Only used if diagonal pivoting is attempted.
16: scaling: 1 (divide each row by sum of abs. values in each row)
17: frontal matrix allocation ratio: 0.5
18: drop tolerance: 0
19: AMD and COLAMD aggressive absorption: 1 (yes)
The following options can only be changed at compile-time:
8: BLAS library used: Fortran BLAS. size of BLAS integer: 4
compiled for ANSI C
POSIX C clock_getttime.
computer/operating system: Linux
size of int: 4 SuiteSparse_long: 8 Int: 4 pointer: 8 double: 8 Entry: 8 (in bytes)
symbolic factorization...finished (7.538149e-01 seconds)
numeric factorization...finished (3.967724e+00 seconds)
backsolve...finished (6.808259e-01 seconds)
total elapsed time: 5.456142e+00 seconds
First Petsc solve
reading matrix...finished
preallocating matrix...finished (3.473043e-03 seconds)
copying values...finished (7.642679e-01 seconds)
KSP solve...KSP Object: 1 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: 1 MPI processes
type: lu
LU: out-of-place factorization
tolerance for zero pivot 2.22045e-14
matrix ordering: nd
factor fill ratio given 5., needed 12.0718
Factored matrix follows:
Mat Object: 1 MPI processes
type: seqaij
rows=455672, cols=455672
package used to perform factorization: petsc
total: nonzeros=182035800, allocated nonzeros=182035800
total number of mallocs used during MatSetValues calls =0
using I-node routines: found 113918 nodes, limit used is 5
linear system matrix = precond matrix:
Mat Object: 1 MPI processes
type: seqaij
rows=455672, cols=455672
total: nonzeros=15079424, allocated nonzeros=15079424
total number of mallocs used during MatSetValues calls =0
using I-node routines: found 113918 nodes, limit used is 5
finished (1.395323e+02 seconds)
Second UMFPack solve
reading matrix...finished
UMFPACK V5.7.1 (Oct 10, 2014), Control:
Matrix entry defined as: double
Int (generic integer) defined as: int
0: print level: 2
1: dense row parameter: 0.2
"dense" rows have > max (16, (0.2)*16*sqrt(n_col) entries)
2: dense column parameter: 0.2
"dense" columns have > max (16, (0.2)*16*sqrt(n_row) entries)
3: pivot tolerance: 0.1
4: block size for dense matrix kernels: 32
5: strategy: 0 (auto)
10: ordering: 1 AMD/COLAMD
11: singleton filter: enabled
6: initial allocation ratio: 0.7
7: max iterative refinement steps: 2
13: Q fixed during numerical factorization: 0 (auto)
14: AMD dense row/col parameter: 10
"dense" rows/columns have > max (16, (10)*sqrt(n)) entries
Only used if the AMD ordering is used.
15: diagonal pivot tolerance: 0.001
Only used if diagonal pivoting is attempted.
16: scaling: 1 (divide each row by sum of abs. values in each row)
17: frontal matrix allocation ratio: 0.5
18: drop tolerance: 0
19: AMD and COLAMD aggressive absorption: 1 (yes)
The following options can only be changed at compile-time:
8: BLAS library used: Fortran BLAS. size of BLAS integer: 4
compiled for ANSI C
POSIX C clock_getttime.
computer/operating system: Linux
size of int: 4 SuiteSparse_long: 8 Int: 4 pointer: 8 double: 8 Entry: 8 (in bytes)
symbolic factorization...finished (7.242250e-01 seconds)
numeric factorization...finished (3.962819e+00 seconds)
backsolve...finished (6.805091e-01 seconds)
total elapsed time: 5.421341e+00 seconds
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
./test on a arch-linux2-c-opt named baduk.scorec.rpi.edu with 1 processor, by creanj Fri Jul 5 10:16:56 2019
Using Petsc Release Version 3.7.6, Apr, 24, 2017
Max Max/Min Avg Total
Time (sec): 1.519e+02 1.00000 1.519e+02
Objects: 1.200e+01 1.00000 1.200e+01
Flops: 3.612e+11 1.00000 3.612e+11 3.612e+11
Flops/sec: 2.378e+09 1.00000 2.378e+09 2.378e+09
MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Reductions: 0.000e+00 0.00000
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flops
and VecAXPY() for complex vectors of length N --> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total Avg %Total counts %Total
0: Main Stage: 1.5189e+02 100.0% 3.6120e+11 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
MatSolve 1 1.0 1.8359e-01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1981
MatLUFactorSym 1 1.0 3.4562e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatLUFactorNum 1 1.0 1.3577e+02 1.0 3.61e+11 1.0 0.0e+00 0.0e+00 0.0e+00 89100 0 0 0 89100 0 0 0 2658
MatAssemblyBegin 1 1.0 5.0068e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 1.7591e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 1.5704e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 1.2642e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatView 2 1.0 9.4652e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 3 1.0 1.8690e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSetUp 1 1.0 7.1526e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 1.3953e+02 1.0 3.61e+11 1.0 0.0e+00 0.0e+00 0.0e+00 92100 0 0 0 92100 0 0 0 2589
PCSetUp 1 1.0 1.3935e+02 1.0 3.61e+11 1.0 0.0e+00 0.0e+00 0.0e+00 92100 0 0 0 92100 0 0 0 2589
PCApply 1 1.0 1.8360e-01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1980
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Matrix 2 2 2372679668 0.
Vector 2 2 7293808 0.
Krylov Solver 1 1 1160 0.
Preconditioner 1 1 992 0.
Index Set 5 5 4560600 0.
Viewer 1 0 0 0.
========================================================================================================================
Average time to get PetscTime(): 2.38419e-08
#PETSc Option Table entries:
-ksp_type preonly
-ksp_view
-log_view
-mat_umfpack_prl 2
-pc_factor_mat_solver_type umfpack
-pc_type lu
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-debugging=0 COPTFLAGS="-O3 -march=native -mtune=native" CXXOPTFLAGS="-O3 -march=native -mtune=native" FOPTFLAGS="-O3 -march=native -mtune=native" --with-blas-lib=/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2//lib/libopenblas.so --with-lapack-lib=/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2//lib/libopenblas.so --prefix=/opt/scorec/ODL_common/blockade/petsc/3.7.6_opt --download-suitesparse=yes
-----------------------------------------
Libraries compiled on Fri Jul 5 08:48:14 2019 on baduk.scorec.rpi.edu
Machine characteristics: Linux-3.10.0-957.12.2.el7.x86_64-x86_64-with-redhat-7.6-Maipo
Using PETSc directory: /lore/creanj/build/petsc-3.7.6
Using PETSc arch: arch-linux2-c-opt
-----------------------------------------
Using C compiler: mpicc -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fvisibility=hidden -O3 -march=native -mtune=native ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: mpif90 -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -O3 -march=native -mtune=native ${FOPTFLAGS} ${FFLAGS}
-----------------------------------------
Using include paths: -I/lore/creanj/build/petsc-3.7.6/arch-linux2-c-opt/include -I/lore/creanj/build/petsc-3.7.6/include -I/lore/creanj/build/petsc-3.7.6/include -I/lore/creanj/build/petsc-3.7.6/arch-linux2-c-opt/include -I/opt/scorec/ODL_common/blockade/petsc/3.7.6_opt/include
-----------------------------------------
Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/lore/creanj/build/petsc-3.7.6/arch-linux2-c-opt/lib -L/lore/creanj/build/petsc-3.7.6/arch-linux2-c-opt/lib -lpetsc -Wl,-rpath,/opt/scorec/ODL_common/blockade/petsc/3.7.6_opt/lib -L/opt/scorec/ODL_common/blockade/petsc/3.7.6_opt/lib -Wl,-rpath,/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2//lib -L/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2//lib -Wl,-rpath,/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib64 -L/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib64 -Wl,-rpath,/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-7.3.0/mpich-3.3-diz4f6ieln25ouifyc7ndtqlfksom6nb/lib -L/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-7.3.0/mpich-3.3-diz4f6ieln25ouifyc7ndtqlfksom6nb/lib -Wl,-rpath,/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib/gcc/x86_64-pc-linux-gnu/7.3.0 -L/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib/gcc/x86_64-pc-linux-gnu/7.3.0 -Wl,-rpath,/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2/lib -L/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2/lib -Wl,-rpath,/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib -L/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib -Wl,-rpath,/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib:/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib64 -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig -lopenblas -lX11 -lpthread -lm -lmpifort -lgfortran -lm -lgfortran -lm -lquadmath -lmpicxx -lstdc++ -lm -ldl -lmpi -lgcc_s -ldl
-----------------------------------------
More information about the petsc-users
mailing list