[petsc-users] LU Performance

Jared Crean jcrean01 at gmail.com
Fri Jul 5 09:26:58 CDT 2019


     This is in reply to both David and Barry's emails.

     I am using the Umfpack that Petsc built (--download-suitesparse=yes 
was passed to configure), so all the compiler flags and Blas/Lapack 
libraries are the same.  I used OpenBlas for Blas and Lapack, with 
multi-threading disabled.  When calling Umfpack directly, the 
factorization takes about 4 second, compared to 135 seconds spent in 
MatLUFactorNum when using Umfpack via Petsc.

     I added a call to umfpack_di_report_control()  (which prints  the 
Umfpack parameters) to my code, and also added -mat_umfpack_prl 2 to the 
Petsc options, which should cause Petsc to call the same function just 
before doing the symbolic factorization (umfpack.c line 245 in Petsc 
3.7.6). The output is attached (also with the -ksp_view option).  My 
code did print the parameters, but Petsc did not, which makes me think 
MatLUFactorSymbolic_UMFPACK never got called.  For reference, here is 
how I am invoking the program:

./test -ksp_type preonly -pc_type lu -pc_factor_mat_solver_type umfpack 
-log_view -ksp_view -mat_umfpack_prl 2 > fout_umfpacklu

     Jared Crean

On 7/5/19 4:02 AM, Smith, Barry F. wrote:
>     When you use Umfpack standalone do you use OpenMP threads? When you use umfpack alone do you us thread enabled BLAS/LAPACK? Perhaps OpenBLAS or MKL?
>
>     You can run both cases with -ksp_view and it will print more details indicating indicating the solver used.
>
>      Do you use the same compiler and same options when compiling PETSc and Umfpack standalone. Is the Umfpack standalone time in the numerical factorization much smaller? Perhaps umfpack is using a much better ordering then when used with PETSc (perhaps the default orderings are different).
>
>     Does Umfpack has a routine that tiggers output of the parameters etc it is using? If you can trigger it you might see differences between standalone and not.
>
>     Barry
>
>
>> On Jul 4, 2019, at 4:05 PM, Jared Crean via petsc-users <petsc-users at mcs.anl.gov> wrote:
>>
>> Hello,
>>
>>      I am getting very bad performance from the Umfpack LU solver when I use it via Petsc compared to calling Umfpack directly. It takes about 5.5 seconds to factor and solve the matrix with Umfpack, but 140 seconds when I use Petsc with -ksp_type preonly -pc_type lu -pc_factor_mat_solver_type umfpack.
>>
>>      I have attached a minimal example (test.c) that reads a matrix from a file, solves with Umfpack, and then solves with Petsc.  The matrix data files are not included because they are about 250 megabytes.  I also attached the output of the program with -log_view for -pc_factor_mat_solver_type umfpack (fout_umfpacklu) and -pc_factor_mat_solver_type petsc (fout_petsclu).  Both results show nearly all of the time is spent in MatLuFactorNum.  The times are very similar, so I am wondering if Petsc is really calling Umfpack or if the Petsc LU solver is getting called in both cases.
>>
>>
>>      Jared Crean
>>
>> <test_files.tar.gz>


-------------- next part --------------
First UMFPack solve
reading matrix...finished
UMFPACK V5.7.1 (Oct 10, 2014), Control:
    Matrix entry defined as: double
    Int (generic integer) defined as: int

    0: print level: 2
    1: dense row parameter:    0.2
        "dense" rows have    > max (16, (0.2)*16*sqrt(n_col) entries)
    2: dense column parameter: 0.2
        "dense" columns have > max (16, (0.2)*16*sqrt(n_row) entries)
    3: pivot tolerance: 0.1
    4: block size for dense matrix kernels: 32
    5: strategy: 0 (auto)
    10: ordering: 1 AMD/COLAMD
    11: singleton filter: enabled
    6: initial allocation ratio: 0.7
    7: max iterative refinement steps: 2
    13: Q fixed during numerical factorization: 0 (auto)
    14: AMD dense row/col parameter:    10
       "dense" rows/columns have > max (16, (10)*sqrt(n)) entries
        Only used if the AMD ordering is used.
    15: diagonal pivot tolerance: 0.001
        Only used if diagonal pivoting is attempted.
    16: scaling: 1 (divide each row by sum of abs. values in each row)
    17: frontal matrix allocation ratio: 0.5
    18: drop tolerance: 0
    19: AMD and COLAMD aggressive absorption: 1 (yes)

    The following options can only be changed at compile-time:
    8: BLAS library used:  Fortran BLAS.  size of BLAS integer: 4
    compiled for ANSI C
    POSIX C clock_getttime.
    computer/operating system: Linux
    size of int: 4 SuiteSparse_long: 8 Int: 4 pointer: 8 double: 8 Entry: 8 (in bytes)

symbolic factorization...finished (7.538149e-01 seconds)
numeric factorization...finished (3.967724e+00 seconds)
backsolve...finished (6.808259e-01 seconds)
total elapsed time: 5.456142e+00 seconds


First Petsc solve
reading matrix...finished
preallocating matrix...finished (3.473043e-03 seconds)
copying values...finished (7.642679e-01 seconds)
KSP solve...KSP Object: 1 MPI processes
  type: preonly
  maximum iterations=10000, initial guess is zero
  tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
  left preconditioning
  using NONE norm type for convergence test
PC Object: 1 MPI processes
  type: lu
    LU: out-of-place factorization
    tolerance for zero pivot 2.22045e-14
    matrix ordering: nd
    factor fill ratio given 5., needed 12.0718
      Factored matrix follows:
        Mat Object:         1 MPI processes
          type: seqaij
          rows=455672, cols=455672
          package used to perform factorization: petsc
          total: nonzeros=182035800, allocated nonzeros=182035800
          total number of mallocs used during MatSetValues calls =0
            using I-node routines: found 113918 nodes, limit used is 5
  linear system matrix = precond matrix:
  Mat Object:   1 MPI processes
    type: seqaij
    rows=455672, cols=455672
    total: nonzeros=15079424, allocated nonzeros=15079424
    total number of mallocs used during MatSetValues calls =0
      using I-node routines: found 113918 nodes, limit used is 5
finished (1.395323e+02 seconds)


Second UMFPack solve
reading matrix...finished
UMFPACK V5.7.1 (Oct 10, 2014), Control:
    Matrix entry defined as: double
    Int (generic integer) defined as: int

    0: print level: 2
    1: dense row parameter:    0.2
        "dense" rows have    > max (16, (0.2)*16*sqrt(n_col) entries)
    2: dense column parameter: 0.2
        "dense" columns have > max (16, (0.2)*16*sqrt(n_row) entries)
    3: pivot tolerance: 0.1
    4: block size for dense matrix kernels: 32
    5: strategy: 0 (auto)
    10: ordering: 1 AMD/COLAMD
    11: singleton filter: enabled
    6: initial allocation ratio: 0.7
    7: max iterative refinement steps: 2
    13: Q fixed during numerical factorization: 0 (auto)
    14: AMD dense row/col parameter:    10
       "dense" rows/columns have > max (16, (10)*sqrt(n)) entries
        Only used if the AMD ordering is used.
    15: diagonal pivot tolerance: 0.001
        Only used if diagonal pivoting is attempted.
    16: scaling: 1 (divide each row by sum of abs. values in each row)
    17: frontal matrix allocation ratio: 0.5
    18: drop tolerance: 0
    19: AMD and COLAMD aggressive absorption: 1 (yes)

    The following options can only be changed at compile-time:
    8: BLAS library used:  Fortran BLAS.  size of BLAS integer: 4
    compiled for ANSI C
    POSIX C clock_getttime.
    computer/operating system: Linux
    size of int: 4 SuiteSparse_long: 8 Int: 4 pointer: 8 double: 8 Entry: 8 (in bytes)

symbolic factorization...finished (7.242250e-01 seconds)
numeric factorization...finished (3.962819e+00 seconds)
backsolve...finished (6.805091e-01 seconds)
total elapsed time: 5.421341e+00 seconds
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./test on a arch-linux2-c-opt named baduk.scorec.rpi.edu with 1 processor, by creanj Fri Jul  5 10:16:56 2019
Using Petsc Release Version 3.7.6, Apr, 24, 2017 

                         Max       Max/Min        Avg      Total 
Time (sec):           1.519e+02      1.00000   1.519e+02
Objects:              1.200e+01      1.00000   1.200e+01
Flops:                3.612e+11      1.00000   3.612e+11  3.612e+11
Flops/sec:            2.378e+09      1.00000   2.378e+09  2.378e+09
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00      0.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 1.5189e+02 100.0%  3.6120e+11 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatSolve               1 1.0 1.8359e-01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1981
MatLUFactorSym         1 1.0 3.4562e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatLUFactorNum         1 1.0 1.3577e+02 1.0 3.61e+11 1.0 0.0e+00 0.0e+00 0.0e+00 89100  0  0  0  89100  0  0  0  2658
MatAssemblyBegin       1 1.0 5.0068e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 1.7591e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 1.5704e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 1.2642e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatView                2 1.0 9.4652e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                 3 1.0 1.8690e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSetUp               1 1.0 7.1526e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 1.3953e+02 1.0 3.61e+11 1.0 0.0e+00 0.0e+00 0.0e+00 92100  0  0  0  92100  0  0  0  2589
PCSetUp                1 1.0 1.3935e+02 1.0 3.61e+11 1.0 0.0e+00 0.0e+00 0.0e+00 92100  0  0  0  92100  0  0  0  2589
PCApply                1 1.0 1.8360e-01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1980
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Matrix     2              2   2372679668     0.
              Vector     2              2      7293808     0.
       Krylov Solver     1              1         1160     0.
      Preconditioner     1              1          992     0.
           Index Set     5              5      4560600     0.
              Viewer     1              0            0     0.
========================================================================================================================
Average time to get PetscTime(): 2.38419e-08
#PETSc Option Table entries:
-ksp_type preonly
-ksp_view
-log_view
-mat_umfpack_prl 2
-pc_factor_mat_solver_type umfpack
-pc_type lu
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-debugging=0 COPTFLAGS="-O3 -march=native -mtune=native" CXXOPTFLAGS="-O3 -march=native -mtune=native" FOPTFLAGS="-O3 -march=native -mtune=native" --with-blas-lib=/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2//lib/libopenblas.so --with-lapack-lib=/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2//lib/libopenblas.so --prefix=/opt/scorec/ODL_common/blockade/petsc/3.7.6_opt --download-suitesparse=yes
-----------------------------------------
Libraries compiled on Fri Jul  5 08:48:14 2019 on baduk.scorec.rpi.edu 
Machine characteristics: Linux-3.10.0-957.12.2.el7.x86_64-x86_64-with-redhat-7.6-Maipo
Using PETSc directory: /lore/creanj/build/petsc-3.7.6
Using PETSc arch: arch-linux2-c-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fvisibility=hidden -O3 -march=native -mtune=native  ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: mpif90  -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -O3 -march=native -mtune=native   ${FOPTFLAGS} ${FFLAGS} 
-----------------------------------------

Using include paths: -I/lore/creanj/build/petsc-3.7.6/arch-linux2-c-opt/include -I/lore/creanj/build/petsc-3.7.6/include -I/lore/creanj/build/petsc-3.7.6/include -I/lore/creanj/build/petsc-3.7.6/arch-linux2-c-opt/include -I/opt/scorec/ODL_common/blockade/petsc/3.7.6_opt/include
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/lore/creanj/build/petsc-3.7.6/arch-linux2-c-opt/lib -L/lore/creanj/build/petsc-3.7.6/arch-linux2-c-opt/lib -lpetsc -Wl,-rpath,/opt/scorec/ODL_common/blockade/petsc/3.7.6_opt/lib -L/opt/scorec/ODL_common/blockade/petsc/3.7.6_opt/lib -Wl,-rpath,/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2//lib -L/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2//lib -Wl,-rpath,/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib64 -L/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib64 -Wl,-rpath,/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-7.3.0/mpich-3.3-diz4f6ieln25ouifyc7ndtqlfksom6nb/lib -L/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-7.3.0/mpich-3.3-diz4f6ieln25ouifyc7ndtqlfksom6nb/lib -Wl,-rpath,/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib/gcc/x86_64-pc-linux-gnu/7.3.0 -L/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib/gcc/x86_64-pc-linux-gnu/7.3.0 -Wl,-rpath,/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2/lib -L/opt/scorec/ODL_common/baduk/OpenBlas/0.3.2/lib -Wl,-rpath,/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib -L/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib -Wl,-rpath,/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib:/opt/scorec/spack/install/linux-rhel7-x86_64/gcc-rhel7_4.8.5/gcc-7.3.0-bt47fwrzijla4xzdx4o4au45yljqptsk/lib64 -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig -lopenblas -lX11 -lpthread -lm -lmpifort -lgfortran -lm -lgfortran -lm -lquadmath -lmpicxx -lstdc++ -lm -ldl -lmpi -lgcc_s -ldl
-----------------------------------------


More information about the petsc-users mailing list