[petsc-users] Reducing cost of MatSetValues

Wed Jun 1 16:15:41 CDT 2011

   If you run with -info and grep the results for "Stash has" do you see much in the way of stashed values? These are values computed on one process that must be shipped to another process. Likely you have done a good job of partitioning and there are not that many relative to local number of values set. If the number of mallocs is high for this you can run with a big enough -matstash_initial_size so that no additional mallocs are needed.

   Also if you are hardwired to the MPIAIJ format you can call MatSetValues_MPIAIJ() directly instead of MatSetValues() (just add a prototype for this function to your source code). This could shave off a couple percent.

   I am hard pressed at seeing anyway to really shrink down the code in MatSetValues_MPIAIJ() anymore than it already is.  Since your linear solves are pretty easy it is tough to make the setting values a small percent of the total time.

   Using -snes_lag_jacobian 2   or 3 (or even bigger) might help.

    Barry

On Jun 1, 2011, at 2:53 PM, John Fettig wrote:

> What are the recommendations for reducing the cost of inserting values
> into an AIJ matrix?  In my application (transient finite element
> solution of flow and heat, linear elements), this is accounting for up
> to 20% of overall runtime.  Is this expected?
> 
> I have double checked that the matrices are preallocated correctly,
> and I have set MAT_NEW_NONZERO_ALLOCATION_ERR and it runs without
> error.
> 
> The matrices periodically change size/nonzero pattern, but until then
> the values are zeroed out and MatSetOption( mat,
> MAT_KEEP_NONZERO_PATTERN, PETSC_TRUE); is called.
> 
> The call to MatSetValues happens on a per-element basis on the local
> element matrix, so one call per element.  I re-activated the
> MAT_SetValues event and have included a -log_summary from a short run.
> 
> Thanks,
> John
> 
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
> ************************************************************************************************************************
> 
> ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
> 
> ./test on a intel-opt named lagrange with 1 processor, by jfe Wed Jun
> 1 15:49:20 2011
> Using Petsc Release Version 3.1.0, Patch 8, Thu Mar 17 13:37:48 CDT 2011
> 
>                         Max       Max/Min        Avg      Total
> Time (sec):           2.242e+02      1.00000   2.242e+02
> Objects:              2.718e+03      1.00000   2.718e+03
> Flops:                2.239e+10      1.00000   2.239e+10  2.239e+10
> Flops/sec:            9.986e+07      1.00000   9.986e+07  9.986e+07
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       2.062e+04      1.00000
> 
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length
> N --> 2N flops
>                            and VecAXPY() for complex vectors of
> length N --> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---
> Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 2.2423e+02 100.0%  2.2391e+10 100.0%  0.000e+00
> 0.0%  0.000e+00        0.0%  1.796e+04  87.1%
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in this phase
>      %M - percent messages in this phase     %L - percent message
> lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops
>          --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> KSPSetup             532 1.0 2.6963e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+02  0  0  0  0  1   0  0  0  0  1     0
> KSPSolve             277 1.0 3.2930e+01 1.0 2.17e+10 1.0 0.0e+00
> 0.0e+00 9.9e+03 15 97  0  0 48  15 97  0  0 55   660
> PCSetUp              277 1.0 8.3035e+00 1.0 7.52e+08 1.0 0.0e+00
> 0.0e+00 1.4e+03  4  3  0  0  7   4  3  0  0  8    91
> PCApply             4907 1.0 9.1353e+00 1.0 7.08e+09 1.0 0.0e+00
> 0.0e+00 2.0e+00  4 32  0  0  0   4 32  0  0  0   775
> MatMult            12821 1.0 1.5110e+01 1.0 1.34e+10 1.0 0.0e+00
> 0.0e+00 0.0e+00  7 60  0  0  0   7 60  0  0  0   888
> MatMultAdd          3668 1.0 6.9999e-01 1.0 2.83e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0   404
> MatSolve             917 1.0 5.3644e-04 1.0 9.17e+02 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     2
> MatSOR              7336 1.0 5.7982e+00 1.0 4.62e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00  3 21  0  0  0   3 21  0  0  0   797
> MatLUFactorSym        51 1.0 3.1686e-04 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 5.1e+01  0  0  0  0  0   0  0  0  0  0     0
> MatLUFactorNum        51 1.0 1.2851e-04 1.0 5.10e+01 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin    1068 1.0 1.4639e-04 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd      1068 1.0 7.9752e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatSetValues     89511956 1.0 4.6221e+01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 21  0  0  0  0  21  0  0  0  0     0
> MatGetRowIJ           51 1.0 7.7248e-05 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering        51 1.0 5.0688e-04 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+02  0  0  0  0  0   0  0  0  0  1     0
> MatZeroEntries       133 1.0 1.7668e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecMax                85 1.0 6.9754e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 8.5e+01  0  0  0  0  0   0  0  0  0  0     0
> VecDot              5479 1.0 3.5107e-01 1.0 8.84e+08 1.0 0.0e+00
> 0.0e+00 5.5e+03  0  4  0  0 27   0  4  0  0 31  2517
> VecNorm             2937 1.0 1.5535e+00 1.0 4.81e+08 1.0 0.0e+00
> 0.0e+00 2.9e+03  1  2  0  0 14   1  2  0  0 16   310
> VecCopy              955 1.0 8.6689e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              6718 1.0 9.4952e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY             6937 1.0 6.1062e-01 1.0 1.15e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  5  0  0  0   0  5  0  0  0  1891
> VecAYPX             4483 1.0 1.4344e-01 1.0 1.54e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  1073
> VecWAXPY            6624 1.0 9.5558e-01 1.0 1.04e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  5  0  0  0   0  5  0  0  0  1089
> VecAssemblyBegin    1286 1.0 5.1773e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 3.9e+03  0  0  0  0 19   0  0  0  0 21     0
> VecAssemblyEnd      1286 1.0 1.2426e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecPointwiseMult    3990 1.0 6.7400e-01 1.0 3.57e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  2  0  0  0   0  2  0  0  0   529
> ------------------------------------------------------------------------------------------------------------------------
> 
> Memory usage is given in bytes:
> 
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
> 
> --- Event Stage 0: Main Stage
> 
>       Krylov Solver   258            258       218416     0
>      Preconditioner   258            258       193856     0
>              Matrix   666            666    140921396     0
>                 Vec  1383           1382    244199352     0
>           Index Set   153            153        81396     0
> ========================================================================================================================
> Average time to get PetscTime(): 0
> #PETSc Option Table entries:
> -log_summary
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Sun May 22 22:30:16 2011
> Configure options: --with-x=0 --download-f-blas-lapack=0
> --with-blas-lapack-dir=/opt/intel/Compiler/11.1/072/mkl/lib/em64t
> --with-mpi=1 --with-mpi-shared=1
> --with-mpi-include=/usr/local/encap/hpmpi-8.01/include
> --with-mpi-lib="[/usr/local/encap/hpmpi-8.01/lib/linux_amd64/libmpi.so,/usr/local/encap/hpmpi-8.01/lib/linux_amd64/libfmpi.so]"
> --with-mpi=1 --download-mpich=no --with-debugging=0
> --with-gnu-compilers=no --with-vendor-compilers=intel --with-cc=icc
> --with-cxx=icpc --with-fc=ifort --with-shared=1 --with-c++-support
> --with-clanguage=C --COPTFLAGS="-fPIC -O3 -xSSE4.2 -g -debug
> inline_debug_info" --CXXOPTFLAGS="-fPIC -O3 -xSSE4.2 -g -debug
> inline_debug_info" --FOPTFLAGS="-fPIC -O3 -xSSE4.2 -g -debug
> inline_debug_info" --download-scalapack=no --download-blacs=no
> --with-blacs=1 --with-blacs-lib=/opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_blacs_lp64.a
> --with-blacs-include=/opt/intel/Compiler/11.1/072/mkl/include
> --with-scalapack=1
> --with-scalapack-lib="[/opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_scalapack_lp64.a,/opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_intel_lp64.a,/opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_intel_thread.a,/opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_core.a]"
> --with-scalapack-include=/opt/intel/Compiler/11.1/072/mkl/include
> --download-umfpack=1 --download-parmetis=1 --download-superlu_dist=1
> --download-mumps=1 --download-ml=1 --with-hypre=1 --download-hypre=yes
> -----------------------------------------
> Libraries compiled on Thu May 26 16:49:30 EDT 2011 on lagrange
> Machine characteristics: Linux lagrange 2.6.39-0.el5.elrepo #1 SMP
> PREEMPT Sat May 21 04:48:38 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
> Using PETSc directory: /home/jfe/local/centos/petsc-3.1-p8
> Using PETSc arch: intel-opt
> -----------------------------------------
> Using C compiler: icc -fPIC -fPIC -O3 -xSSE4.2 -g -debug inline_debug_info
> Using Fortran compiler: ifort -fPIC -fPIC -O3 -xSSE4.2 -g -debug
> inline_debug_info
> -----------------------------------------
> Using include paths:
> -I/home/jfe/local/centos/petsc-3.1-p8/intel-opt/include
> -I/home/jfe/local/centos/petsc-3.1-p8/include
> -I/home/jfe/local/centos/petsc-3.1-p8/intel-opt/include
> -I/usr/local/encap/hpmpi-8.01/include
> ------------------------------------------
> Using C linker: icc -fPIC -fPIC -O3 -xSSE4.2 -g -debug inline_debug_info
> Using Fortran linker: ifort -fPIC -fPIC -O3 -xSSE4.2 -g -debug
> inline_debug_info
> Using libraries:
> -Wl,-rpath,/home/jfe/local/centos/petsc-3.1-p8/intel-opt/lib
> -L/home/jfe/local/centos/petsc-3.1-p8/intel-opt/lib -lpetsc
> -Wl,-rpath,/home/jfe/local/centos/petsc-3.1-p8/intel-opt/lib
> -L/home/jfe/local/centos/petsc-3.1-p8/intel-opt/lib -lcmumps -ldmumps
> -lsmumps -lzmumps -lmumps_common -lpord
> -Wl,-rpath,/opt/intel/Compiler/11.1/072/mkl/lib/em64t
> -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t -lmkl_scalapack_lp64
> -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_lp64
> -lHYPRE -Wl,-rpath,/opt/intel/Compiler/11.1/072/lib/intel64
> -Wl,-rpath,/opt/intel/Compiler/11.1/072/ipp/em64t/lib
> -Wl,-rpath,/opt/intel/Compiler/11.1/072/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -lstdc++ -lml
> -lstdc++ -lsuperlu_dist_2.4 -lparmetis -lmetis -lumfpack -lamd
> -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
> -Wl,-rpath,/usr/local/encap/hpmpi-8.01/lib/linux_amd64
> -L/usr/local/encap/hpmpi-8.01/lib/linux_amd64 -lmpi -lfmpi -ldl
> -Wl,-rpath,/opt/intel/Compiler/11.1/072/lib/intel64
> -L/opt/intel/Compiler/11.1/072/lib/intel64
> -Wl,-rpath,/opt/intel/Compiler/11.1/072/ipp/em64t/lib
> -L/opt/intel/Compiler/11.1/072/ipp/em64t/lib
> -Wl,-rpath,/opt/intel/Compiler/11.1/072/mkl/lib/em64t
> -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t
> -Wl,-rpath,/opt/intel/Compiler/11.1/072/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib
> -L/opt/intel/Compiler/11.1/072/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.1.2
> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -limf -lsvml -lipgo -ldecimal
> -lgcc_s -lirc -lirc_s -lifport -lifcore -lm -lpthread -lm -lstdc++
> -lstdc++ -ldl -limf -lsvml -lipgo -ldecimal -lgcc_s -lirc -lirc_s -ldl
> ------------------------------------------