[petsc-users] Petsc Performance Help

Fri Mar 16 14:52:00 CDT 2012

Dear PETSc Group,
I am tuning the efficiency of my PETSc code for a while, but get very  
little progress. So can anyone help me to analysis the log? Any  
suggestions will be appreciated.

My problem is time dependent. At every time step, two about 6000 by  
6000 sparse matrices need to be solved, which come from a Poisson  
equation. I use both sequential and parallel AIJ format to store  
matrices, but the performances are both not very good.

Please let me know if you need more information of the code or the problem.

Thanks in advance!

Best,
Nan
-------------- next part --------------
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./3dtest5 on a linux-gnu named babylon3 with 1 processor, by nan_jia Fri Mar 16 15:13:12 2012
Using Petsc Release Version 3.1.0, Patch 8, Thu Mar 17 13:37:48 CDT 2011

                         Max       Max/Min        Avg      Total 
Time (sec):           8.810e+02      1.00000   8.810e+02
Objects:              9.000e+01      1.00000   9.000e+01
Flops:                4.963e+09      1.00000   4.963e+09  4.963e+09
Flops/sec:            5.634e+06      1.00000   5.634e+06  5.634e+06
Memory:               7.493e+06      1.00000              7.493e+06
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       2.732e+03      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 8.8098e+02 100.0%  4.9632e+09 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  2.645e+03  96.8% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------

      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was compiled with a debugging option,      #
      #   To get timing results run config/configure.py        #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################

Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

KSPGMRESOrthog      3134 1.0 2.4914e-01 1.0 1.82e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  4  0  0  0   0  4  0  0  0   729
KSPSetup           20001 1.0 1.0103e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+01  0  0  0  0  0   0  0  0  0  0     0
KSPSolve           20001 1.0 1.8803e+01 1.0 4.96e+09 1.0 0.0e+00 0.0e+00 2.6e+03  2100  0  0 96   2100  0  0100   264
VecMDot             3134 1.0 8.5895e-02 1.0 9.08e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0   0  2  0  0  0  1057
VecNorm            43141 1.0 3.4036e-01 1.0 5.47e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0 11  0  0  0   0 11  0  0  0  1606
VecScale           23143 1.0 1.1922e-01 1.0 1.47e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  3  0  0  0   0  3  0  0  0  1230
VecCopy            20009 1.0 4.4645e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet              2553 1.0 1.6527e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY            22556 1.0 2.2632e-01 1.0 2.86e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  6  0  0  0   0  6  0  0  0  1263
VecMAXPY            5684 1.0 2.3512e-01 1.0 1.31e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  3  0  0  0   0  3  0  0  0   555
VecAssemblyBegin   40002 1.0 1.8107e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd     40002 1.0 1.3301e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecNormalize       23143 1.0 3.9494e-01 1.0 4.40e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  9  0  0  0   0  9  0  0  0  1114
MatMult            23140 1.0 5.4058e+00 1.0 1.31e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 26  0  0  0   1 26  0  0  0   243
MatSolve           43141 1.0 1.1149e+01 1.0 2.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 49  0  0  0   1 49  0  0  0   220
MatLUFactorNum         2 1.0 2.5282e-03 1.0 1.14e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    45
MatILUFactorSym        2 1.0 3.0980e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin   20001 1.0 1.8443e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd     20001 1.0 4.5861e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
MatGetRowIJ            2 1.0 2.8610e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         2 1.0 2.2390e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCSetUp                2 1.0 8.2610e-03 1.0 1.14e+05 1.0 0.0e+00 0.0e+00 1.0e+01  0  0  0  0  0   0  0  0  0  0    14
PCApply            43141 1.0 1.1242e+01 1.0 2.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 49  0  0  0   1 49  0  0  0   218
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

       Krylov Solver     2              2        36112     0
                 Vec    73             73      3794248     0
              Matrix     7              5      2412812     0
      Preconditioner     2              2         1520     0
           Index Set     6              6       155232     0
========================================================================================================================
Average time to get PetscTime(): 0
#PETSc Option Table entries:
-log_summary
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Thu Feb  2 15:41:42 2012
Configure options: FOPTFLAGS=-O3
-----------------------------------------
Libraries compiled on Thu Feb  2 15:43:28 EST 2012 on babylon1 
Machine characteristics: Linux babylon1 2.6.32-37-generic #81-Ubuntu SMP Fri Dec 2 20:32:42 UTC 2011 x86_64 GNU/Linux 
Using PETSc directory: /thayerfs/research/anatoly/NAN/Petsc/petsc-3.1-p8
Using PETSc arch: linux-gnu-c-debug
-----------------------------------------
Using C compiler: mpicc -Wall -Wwrite-strings -Wno-strict-aliasing -g3   
Using Fortran compiler: mpif90 -Wall -Wno-unused-variable -O3    
-----------------------------------------
Using include paths: -I/thayerfs/research/anatoly/NAN/Petsc/petsc-3.1-p8/linux-gnu-c-debug/include -I/thayerfs/research/anatoly/NAN/Petsc/petsc-3.1-p8/include -I/usr/lib/openmpi/include -I/usr/lib/openmpi/lib  
------------------------------------------
Using C linker: mpicc -Wall -Wwrite-strings -Wno-strict-aliasing -g3 
Using Fortran linker: mpif90 -Wall -Wno-unused-variable -O3  
Using libraries: -Wl,-rpath,/thayerfs/research/anatoly/NAN/Petsc/petsc-3.1-p8/linux-gnu-c-debug/lib -L/thayerfs/research/anatoly/NAN/Petsc/petsc-3.1-p8/linux-gnu-c-debug/lib -lpetsc       -lX11 -llapack -lblas -L/usr/lib/openmpi/lib -L/thayerfs/apps/intel/Compiler/11.0/081/ipp/em64t/lib -L/usr/lib/gcc/x86_64-linux-gnu/4.4.3 -L/thayerfs/apps/intel/Compiler/11.0/081/mkl/lib/em64t -L/mnt/thayerfs/anatoly/NAN/Petsc/petsc-3.1-p8 -L/usr/lib/x86_64-linux-gnu -ldl -lmpi -lopen-rte -lopen-pal -lnsl -lutil -lgcc_s -lpthread -lmpi_f90 -lmpi_f77 -lgfortran -lm -lm -lm -lm -ldl -lmpi -lopen-rte -lopen-pal -lnsl -lutil -lgcc_s -lpthread -ldl  
------------------------------------------