[petsc-users] zero pattern of result of matmatmult

Tue Sep 17 08:57:22 CDT 2013

On Tue, 2013-09-17 at 08:15 -0500, Barry Smith wrote:
> When the linear system is fairly easy it is possible for the construction of the matrix to take a significant portion of the total time. Run with -log_summary and send the output, from this we may be able to make suggestions as to what further improvements can be made.
Attached. 
> 
>     Is any part of the matrix constant? If so you can use MatStoreValues() and MatRetrieveValues() so that part does not have to be recomputed each time.
Not in terms of entries in the matrix. Mathematically, part of the
operator is constant, but all entries has changing parts added to them.

> 
>     Are you calling MatSetValuesStencil() for each single entry or once per row or block row? Row or block row will be faster than for each entry.
once per entry - this I should definitely improve. What's the logic
behind row or block rows being faster? The fewer calls to
MatSetValuesStencil, the better?

> 
>     Since you are using MatSetValuesStencil() I assume you are using a DA?

I am, with one DOF per point. Are there better alternatives?

PS. FYI: My stencil is essentially a standard 2D, 9-point box stencil,
but the values in all rows are different (due, among other things, to
the geometric tensor).

Thanks for the help...

/Frederik

-------------- next part --------------
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./petsclapl on a arch-linux2-c-debug named frtr-laptop with 2 processors, by frtr Tue Sep 17 14:24:35 2013
Using Petsc Release Version 3.4.1, Jun, 10, 2013 

                         Max       Max/Min        Avg      Total 
Time (sec):           4.045e+02      1.00001   4.045e+02
Objects:              1.336e+03      1.00000   1.336e+03
Flops:                8.196e+08      1.00000   8.196e+08  1.639e+09
Flops/sec:            2.026e+06      1.00001   2.026e+06  4.052e+06
Memory:               2.755e+07      1.00000              5.510e+07
MPI Messages:         3.902e+03      1.00000   3.902e+03  7.804e+03
MPI Message Lengths:  6.317e+06      1.00000   1.619e+03  1.263e+07
MPI Reductions:       3.193e+04      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 4.0449e+02 100.0%  1.6392e+09 100.0%  7.804e+03 100.0%  1.619e+03      100.0%  3.193e+04 100.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %f - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------

      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was compiled with a debugging option,      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################

Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %f %M %L %R  %T %f %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

ThreadCommRunKer       1 1.0 9.8039e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
ThreadCommBarrie       1 1.0 6.7013e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecMDot              250 1.0 9.3867e-01 1.1 4.59e+06 1.0 0.0e+00 0.0e+00 2.5e+02  0  1  0  0  1   0  1  0  0  1    10
VecTDot             5782 1.0 5.8286e+01 1.0 5.90e+07 1.0 0.0e+00 0.0e+00 5.8e+03 14  7  0  0 18  14  7  0  0 18     2
VecNorm             3341 1.0 1.1062e+01 1.1 3.41e+07 1.0 0.0e+00 0.0e+00 3.3e+03  3  4  0  0 10   3  4  0  0 10     6
VecScale             452 1.0 3.6935e+00 1.0 2.31e+06 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     1
VecCopy              308 1.0 1.0530e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet              3756 1.0 7.7467e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY             6885 1.0 4.3978e+01 1.0 6.97e+07 1.0 0.0e+00 0.0e+00 0.0e+00 11  9  0  0  0  11  9  0  0  0     3
VecAYPX             2891 1.0 1.2989e+00 1.0 2.92e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  4  0  0  0   0  4  0  0  0    45
VecMAXPY             350 1.0 1.7310e-01 1.2 7.14e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0    82
VecAssemblyBegin      36 1.0 2.2704e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 9.9e+01  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd        36 1.0 5.6790e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecPointwiseMult     491 1.0 1.9725e-01 1.1 2.50e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    25
VecScatterBegin     3900 1.0 1.3175e+00 1.0 0.00e+00 0.0 7.6e+03 1.6e+03 0.0e+00  0  0 97 99  0   0  0 97 99  0     0
VecScatterEnd       3900 1.0 3.3521e+00 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecNormalize         350 1.0 3.4656e+00 1.0 5.36e+06 1.0 0.0e+00 0.0e+00 3.5e+02  1  1  0  0  1   1  1  0  0  1     3
MatMult             3600 1.0 2.0704e+01 1.0 3.10e+08 1.0 7.2e+03 1.6e+03 0.0e+00  5 38 92 93  0   5 38 92 93  0    30
MatSolve            3341 1.0 1.2289e+01 1.0 2.84e+08 1.0 0.0e+00 0.0e+00 0.0e+00  3 35  0  0  0   3 35  0  0  0    46
MatLUFactorNum        56 1.0 3.4288e+00 1.0 7.72e+06 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0     5
MatILUFactorSym       56 1.0 9.8914e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 5.6e+01  0  0  0  0  0   0  0  0  0  0     0
MatCopy              212 1.0 1.9453e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 7.2e+01  5  0  0  0  0   5  0  0  0  0     0
MatScale             176 1.0 5.4831e-01 1.0 8.07e+06 1.0 0.0e+00 0.0e+00 2.2e+01  0  1  0  0  0   0  1  0  0  0    29
MatAssemblyBegin      85 1.0 1.5119e+00 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.7e+02  0  0  0  0  1   0  0  0  0  1     0
MatAssemblyEnd        85 1.0 1.5816e+00 1.0 0.00e+00 0.0 1.9e+02 4.9e+02 1.1e+03  0  0  2  1  4   0  0  2  1  4     0
MatGetRow         285600 1.0 2.5095e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  6  0  0  0  0   6  0  0  0  0     0
MatGetRowIJ           56 1.0 4.7718e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering        56 1.0 6.4522e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.1e+02  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries        18 1.0 1.8103e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAXPY              174 1.0 7.3996e+01 1.0 0.00e+00 0.0 8.8e+01 4.1e+02 1.0e+03 18  0  1  0  3  18  0  1  0  3     0
MatMatMult             1 1.0 9.7952e-01 1.0 8.17e+05 1.0 1.2e+01 6.5e+03 6.2e+01  0  0  0  1  0   0  0  0  1  0     2
MatMatMultSym          1 1.0 8.2768e-01 1.0 0.00e+00 0.0 1.0e+01 4.9e+03 5.6e+01  0  0  0  0  0   0  0  0  0  0     0
MatMatMultNum          1 1.0 1.5126e-01 1.0 8.17e+05 1.0 2.0e+00 1.5e+04 6.0e+00  0  0  0  0  0   0  0  0  0  0    11
MatGetLocalMat         2 1.0 3.4862e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetBrAoCol          2 1.0 1.4985e-02 1.0 0.00e+00 0.0 8.0e+00 9.3e+03 6.0e+00  0  0  0  1  0   0  0  0  1  0     0
KSPGMRESOrthog       250 1.0 1.2155e+00 1.1 9.18e+06 1.0 0.0e+00 0.0e+00 7.0e+02  0  1  0  0  2   0  1  0  0  2    15
KSPSetUp             112 1.0 6.0548e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 5.4e+02  0  0  0  0  2   0  0  0  0  2     0
KSPSolve             150 1.0 2.3976e+02 1.0 7.62e+08 1.0 6.4e+03 1.6e+03 2.7e+04 59 93 82 82 85  59 93 82 82 85     6
PCSetUp              112 1.0 5.5250e+00 1.0 7.72e+06 1.0 0.0e+00 0.0e+00 7.8e+02  1  1  0  0  2   1  1  0  0  2     3
PCSetUpOnBlocks      150 1.0 5.5561e+00 1.0 7.72e+06 1.0 0.0e+00 0.0e+00 3.4e+02  1  1  0  0  1   1  1  0  0  1     3
PCApply             3341 1.0 4.7031e+01 1.0 2.84e+08 1.0 0.0e+00 0.0e+00 6.7e+03 12 35  0  0 21  12 35  0  0 21    12
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Vector   561            547     14123724     0
      Vector Scatter    55             50        31272     0
              Matrix   199            216     68839696     0
    Distributed Mesh     4              2         5572     0
     Bipartite Graph     8              4         1808     0
           Index Set   280            280      1354392     0
   IS L to G Mapping     4              2         1076     0
       Krylov Solver   112            112       181256     0
      Preconditioner   112            112        60704     0
              Viewer     1              0            0     0
========================================================================================================================
Average time to get PetscTime(): 6.2055e-06
Average time for MPI_Barrier(): 0.000123
Average time for zero size MPI_Send(): 5e-05
#PETSc Option Table entries:
-DIFFUSION 0.1
-DT 0.01
-ET 0.5
-G 2.0
-LOGGAUSSIAN 1,5,15,10.0,2.0
-NODRAW
-OUTSTEP 1
-SQ 30.0,30.0
-SX 100,100,10
-SY 100,100,10
-T 19
-TIMEDEP /home/frtr/tstruns/param.ini
-VISCOSITY 0.1
-log_summary
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure run at: Mon Jul  8 14:50:33 2013
Configure options: --prefix=/home/frtr --download-fftw --download-blacs --download-blas --download-lapack --with-mpi-dir=/home/frtr --download-scalapack --with-debugging=1 COPTFLAGS="-O3 -march=native"
-----------------------------------------
Libraries compiled on Mon Jul  8 14:50:33 2013 on frtr-laptop 
Machine characteristics: Linux-3.2.0-49-generic-pae-i686-with-Ubuntu-12.04-precise
Using PETSc directory: /home/frtr/work_lib/petsc-3.4.1
Using PETSc arch: arch-linux2-c-debug
-----------------------------------------

Using C compiler: /home/frtr/bin/mpicc  -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O3 -march=native  ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: /home/frtr/bin/mpif90  -fPIC -Wall -Wno-unused-variable -Wno-unused-dummy-argument -g   ${FOPTFLAGS} ${FFLAGS} 
-----------------------------------------

Using include paths: -I/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/include -I/home/frtr/work_lib/petsc-3.4.1/include -I/home/frtr/work_lib/petsc-3.4.1/include -I/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/include -I/home/frtr/include
-----------------------------------------

Using C linker: /home/frtr/bin/mpicc
Using Fortran linker: /home/frtr/bin/mpif90
Using libraries: -Wl,-rpath,/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/lib -L/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/lib -lpetsc -Wl,-rpath,/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/lib -L/home/frtr/work_lib/petsc-3.4.1/arch-linux2-c-debug/lib -lscalapack -llapack -lblas -lX11 -lpthread -lfftw3_mpi -lfftw3 -lm -Wl,-rpath,/home/frtr/lib -L/home/frtr/lib -Wl,-rpath,/usr/lib/gcc/i686-linux-gnu/4.6 -L/usr/lib/gcc/i686-linux-gnu/4.6 -Wl,-rpath,/usr/lib/i386-linux-gnu -L/usr/lib/i386-linux-gnu -Wl,-rpath,/lib/i386-linux-gnu -L/lib/i386-linux-gnu -lmpichf90 -lgfortran -lm -lgfortran -lm -lquadmath -lm -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lmpl -lrt -lpthread -lgcc_s -ldl 
-----------------------------------------