Problems with multiplication scaling

Christian Klettner christian.klettner at ucl.ac.uk
Sun Jun 14 11:23:48 CDT 2009


Dear PETSc Team,
I have used Hypres BoomerAMG to cut the iteration count in solving a
Poisson type equation (i.e. Ax=b). The sparse matrix arises from a finite
element discretization of the Navier-Stokes equations. However, the
performance was very poor and so I checked the multiplication routine in
my code. Below are the results for a 1000 250,000x250,000 matrix-vector
operations. The time for the multiplications goes from 15.8 seconds to ~11
seconds when changing from 4 to 8 cores. The ratios indicate that there is
good load balancing so I was wondering if this is to do with how I
configure PETSc??? Or is it my machine->
I am using a 2x quad core 2.3GHz Opteron (Shanghai).
Best regards,
Christian Klettner

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
-fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary:
----------------------------------------------

./ex4 on a linux-gnu named christian-desktop with 4 processors, by
christian Sun Jun 14 16:48:24 2009
Using Petsc Release Version 3.0.0, Patch 4, Fri Mar  6 14:46:08 CST 2009

                         Max       Max/Min        Avg      Total
Time (sec):           1.974e+01      1.00119   1.973e+01
Objects:              1.080e+02      1.00000   1.080e+02
Flops:                8.078e+08      1.00163   8.070e+08  3.228e+09
Flops/sec:            4.095e+07      1.00232   4.090e+07  1.636e+08
Memory:               1.090e+08      1.00942              4.345e+08
MPI Messages:         2.071e+03      2.00000   1.553e+03  6.213e+03
MPI Message Lengths:  2.237e+06      2.00000   1.080e+03  6.712e+06
MPI Reductions:       7.250e+01      1.00000

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N
--> 2N flops
                            and VecAXPY() for complex vectors of length N
--> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts  
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 1.9730e+01 100.0%  3.2281e+09 100.0%  6.213e+03
100.0%  1.080e+03      100.0%  2.120e+02  73.1%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this
phase
      %M - percent messages in this phase     %L - percent message lengths
in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
over all processors)
------------------------------------------------------------------------------------------------------------------------


      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was compiled with a debugging option,      #
      #   To get timing results run config/configure.py        #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################


Event                Count      Time (sec)     Flops                      
      --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecSet                 5 1.0 1.2703e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyBegin       3 1.0 2.9233e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00
9.0e+00  0  0  0  0  3   0  0  0  0  4     0
VecAssemblyEnd         3 1.0 2.2650e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin     1003 1.0 1.8717e-01 4.1 0.00e+00 0.0 6.0e+03 1.1e+03
0.0e+00  1  0 97 95  0   1  0 97 95  0     0
VecScatterEnd       1003 1.0 5.3403e+00 2.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 20  0  0  0  0  20  0  0  0  0     0
MatMult             1000 1.0 1.5877e+01 1.0 8.08e+08 1.0 6.0e+03 1.1e+03
0.0e+00 80100 97 95  0  80100 97 95  0   203
MatAssemblyBegin       7 1.0 3.6728e-01 1.9 0.00e+00 0.0 6.3e+01 5.0e+03
1.4e+01  1  0  1  5  5   1  0  1  5  7     0
MatAssemblyEnd         7 1.0 8.6817e-01 1.2 0.00e+00 0.0 8.4e+01 2.7e+02
7.0e+01  4  0  1  0 24   4  0  1  0 33     0
MatZeroEntries         7 1.0 5.7693e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

   Application Order     2              0          0     0
           Index Set    30             30      18476     0
   IS L to G Mapping    10              0          0     0
                 Vec    30              7       9128     0
         Vec Scatter    15              0          0     0
              Matrix    21              0          0     0
========================================================================================================================
Average time to get PetscTime(): 2.14577e-07
Average time for MPI_Barrier(): 5.89848e-05
Average time for zero size MPI_Send(): 6.80089e-05
#PETSc Option Table entries:
-log_summary output1
#End o PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8
Configure run at: Fri Jun 12 16:59:30 2009
Configure options: --with-cc="gcc -fPIC" --download-mpich=1
--download-f-blas-lapack --download-triangle --download-parmetis
--with-hypre=1 --download-hypre=1 --with-shared=0
-----------------------------------------
Libraries compiled on Fri Jun 12 17:11:54 BST 2009 on christian-desktop
Machine characteristics: Linux christian-desktop 2.6.27-7-generic #1 SMP
Fri Oct 24 06:40:41 UTC 2008 x86_64 GNU/Linux
Using PETSc directory: /home/christian/Desktop/petsc-3.0.0-p4
Using PETSc arch: linux-gnu-c-debug
-----------------------------------------
Using C compiler:
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpicc -Wall
-Wwrite-strings -Wno-strict-aliasing -g3
Using Fortran compiler:
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpif90 -Wall
-Wno-unused-variable -g
-----------------------------------------
Using include paths:
-I/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/include
-I/home/christian/Desktop/petsc-3.0.0-p4/include
-I/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/include
------------------------------------------
Using C linker:
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpicc -Wall
-Wwrite-strings -Wno-strict-aliasing -g3
Using Fortran linker:
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpif90 -Wall
-Wno-unused-variable -g
Using libraries:
-Wl,-rpath,/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib
-L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib -lpetscts
-lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc       
-Wl,-rpath,/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib
-L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib -ltriangle
-lparmetis -lmetis -lHYPRE -lmpichcxx -lstdc++ -lflapack -lfblas -lnsl
-lrt -L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib
-L/usr/lib/gcc/x86_64-linux-gnu/4.3.2 -L/lib -ldl -lmpich -lpthread -lrt
-lgcc_s -lmpichf90 -lgfortranbegin -lgfortran -lm
-L/usr/lib/gcc/x86_64-linux-gnu -lm -lmpichcxx -lstdc++ -lmpichcxx
-lstdc++ -ldl -lmpich -lpthread -lrt -lgcc_s -ldl
------------------------------------------

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
-fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary:
----------------------------------------------

./ex4 on a linux-gnu named christian-desktop with 8 processors, by
christian Sun Jun 14 17:13:40 2009
Using Petsc Release Version 3.0.0, Patch 4, Fri Mar  6 14:46:08 CST 2009

                         Max       Max/Min        Avg      Total
Time (sec):           1.452e+01      1.01190   1.443e+01
Objects:              1.080e+02      1.00000   1.080e+02
Flops:                3.739e+08      1.00373   3.731e+08  2.985e+09
Flops/sec:            2.599e+07      1.01190   2.585e+07  2.068e+08
Memory:               5.157e+07      1.01231              4.117e+08
MPI Messages:         2.071e+03      2.00000   1.812e+03  1.450e+04
MPI Message Lengths:  2.388e+06      2.00000   1.153e+03  1.672e+07
MPI Reductions:       3.625e+01      1.00000

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N
--> 2N flops
                            and VecAXPY() for complex vectors of length N
--> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts  
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 1.4431e+01 100.0%  2.9847e+09 100.0%  1.450e+04
100.0%  1.153e+03      100.0%  2.120e+02  73.1%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this
phase
      %M - percent messages in this phase     %L - percent message lengths
in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
over all processors)
------------------------------------------------------------------------------------------------------------------------


      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was compiled with a debugging option,      #
      #   To get timing results run config/configure.py        #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################


Event                Count      Time (sec)     Flops                      
      --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecSet                 5 1.0 6.1178e-04 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyBegin       3 1.0 7.7400e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00
9.0e+00  0  0  0  0  3   0  0  0  0  4     0
VecAssemblyEnd         3 1.0 4.1008e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin     1003 1.0 1.0858e-01 2.9 0.00e+00 0.0 1.4e+04 1.1e+03
0.0e+00  1  0 97 95  0   1  0 97 95  0     0
VecScatterEnd       1003 1.0 5.3962e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 33  0  0  0  0  33  0  0  0  0     0
MatMult             1000 1.0 1.1430e+01 1.0 3.74e+08 1.0 1.4e+04 1.1e+03
0.0e+00 79100 97 95  0  79100 97 95  0   261
MatAssemblyBegin       7 1.0 4.6307e-01 1.8 0.00e+00 0.0 1.5e+02 5.3e+03
1.4e+01  3  0  1  5  5   3  0  1  5  7     0
MatAssemblyEnd         7 1.0 6.9013e-01 1.3 0.00e+00 0.0 2.0e+02 2.8e+02
7.0e+01  4  0  1  0 24   4  0  1  0 33     0
MatZeroEntries         7 1.0 2.7971e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

   Application Order     2              0          0     0
           Index Set    30             30      18476     0
   IS L to G Mapping    10              0          0     0
                 Vec    30              7       9128     0
         Vec Scatter    15              0          0     0
              Matrix    21              0          0     0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
Average time for MPI_Barrier(): 0.000419807
Average time for zero size MPI_Send(): 0.000115991
#PETSc Option Table entries:
-log_summary output18
#End o PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8
Configure run at: Fri Jun 12 16:59:30 2009
Configure options: --with-cc="gcc -fPIC" --download-mpich=1
--download-f-blas-lapack --download-triangle --download-parmetis
--with-hypre=1 --download-hypre=1 --with-shared=0
-----------------------------------------
Libraries compiled on Fri Jun 12 17:11:54 BST 2009 on christian-desktop
Machine characteristics: Linux christian-desktop 2.6.27-7-generic #1 SMP
Fri Oct 24 06:40:41 UTC 2008 x86_64 GNU/Linux
Using PETSc directory: /home/christian/Desktop/petsc-3.0.0-p4
Using PETSc arch: linux-gnu-c-debug
-----------------------------------------
Using C compiler:
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpicc -Wall
-Wwrite-strings -Wno-strict-aliasing -g3
Using Fortran compiler:
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpif90 -Wall
-Wno-unused-variable -g
-----------------------------------------
Using include paths:
-I/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/include
-I/home/christian/Desktop/petsc-3.0.0-p4/include
-I/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/include
------------------------------------------
Using C linker:
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpicc -Wall
-Wwrite-strings -Wno-strict-aliasing -g3
Using Fortran linker:
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpif90 -Wall
-Wno-unused-variable -g
Using libraries:
-Wl,-rpath,/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib
-L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib -lpetscts
-lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc       
-Wl,-rpath,/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib
-L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib -ltriangle
-lparmetis -lmetis -lHYPRE -lmpichcxx -lstdc++ -lflapack -lfblas -lnsl
-lrt -L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib
-L/usr/lib/gcc/x86_64-linux-gnu/4.3.2 -L/lib -ldl -lmpich -lpthread -lrt
-lgcc_s -lmpichf90 -lgfortranbegin -lgfortran -lm
-L/usr/lib/gcc/x86_64-linux-gnu -lm -lmpichcxx -lstdc++ -lmpichcxx
-lstdc++ -ldl -lmpich -lpthread -lrt -lgcc_s -ldl
------------------------------------------






More information about the petsc-users mailing list