Matvec is a bandwidth limited operation, so adding more compute power will<br>not usually make it go much faster. Hardware manufacturers don't tell you this<br>stuff.<br><br> Matt<br><br><div class="gmail_quote">On Sun, Jun 14, 2009 at 11:23 AM, Christian Klettner <span dir="ltr"><<a href="mailto:christian.klettner@ucl.ac.uk">christian.klettner@ucl.ac.uk</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Dear PETSc Team,<br>
I have used Hypres BoomerAMG to cut the iteration count in solving a<br>
Poisson type equation (i.e. Ax=b). The sparse matrix arises from a finite<br>
element discretization of the Navier-Stokes equations. However, the<br>
performance was very poor and so I checked the multiplication routine in<br>
my code. Below are the results for a 1000 250,000x250,000 matrix-vector<br>
operations. The time for the multiplications goes from 15.8 seconds to ~11<br>
seconds when changing from 4 to 8 cores. The ratios indicate that there is<br>
good load balancing so I was wondering if this is to do with how I<br>
configure PETSc??? Or is it my machine-><br>
I am using a 2x quad core 2.3GHz Opteron (Shanghai).<br>
Best regards,<br>
Christian Klettner<br>
<br>
************************************************************************************************************************<br>
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r<br>
-fCourier9' to print this document ***<br>
************************************************************************************************************************<br>
<br>
---------------------------------------------- PETSc Performance Summary:<br>
----------------------------------------------<br>
<br>
./ex4 on a linux-gnu named christian-desktop with 4 processors, by<br>
christian Sun Jun 14 16:48:24 2009<br>
Using Petsc Release Version 3.0.0, Patch 4, Fri Mar 6 14:46:08 CST 2009<br>
<br>
Max Max/Min Avg Total<br>
Time (sec): 1.974e+01 1.00119 1.973e+01<br>
Objects: 1.080e+02 1.00000 1.080e+02<br>
Flops: 8.078e+08 1.00163 8.070e+08 3.228e+09<br>
Flops/sec: 4.095e+07 1.00232 4.090e+07 1.636e+08<br>
Memory: 1.090e+08 1.00942 4.345e+08<br>
MPI Messages: 2.071e+03 2.00000 1.553e+03 6.213e+03<br>
MPI Message Lengths: 2.237e+06 2.00000 1.080e+03 6.712e+06<br>
MPI Reductions: 7.250e+01 1.00000<br>
<br>
Flop counting convention: 1 flop = 1 real number operation of type<br>
(multiply/divide/add/subtract)<br>
e.g., VecAXPY() for real vectors of length N<br>
--> 2N flops<br>
and VecAXPY() for complex vectors of length N<br>
--> 8N flops<br>
<br>
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages<br>
--- -- Message Lengths -- -- Reductions --<br>
Avg %Total Avg %Total counts<br>
%Total Avg %Total counts %Total<br>
0: Main Stage: 1.9730e+01 100.0% 3.2281e+09 100.0% 6.213e+03<br>
100.0% 1.080e+03 100.0% 2.120e+02 73.1%<br>
<br>
------------------------------------------------------------------------------------------------------------------------<br>
See the 'Profiling' chapter of the users' manual for details on<br>
interpreting output.<br>
Phase summary info:<br>
Count: number of times phase was executed<br>
Time and Flops: Max - maximum over all processors<br>
Ratio - ratio of maximum to minimum over all processors<br>
Mess: number of messages sent<br>
Avg. len: average message length<br>
Reduct: number of global reductions<br>
Global: entire computation<br>
Stage: stages of a computation. Set stages with PetscLogStagePush() and<br>
PetscLogStagePop().<br>
%T - percent time in this phase %F - percent flops in this<br>
phase<br>
%M - percent messages in this phase %L - percent message lengths<br>
in this phase<br>
%R - percent reductions in this phase<br>
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time<br>
over all processors)<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
<br>
##########################################################<br>
# #<br>
# WARNING!!! #<br>
# #<br>
# This code was compiled with a debugging option, #<br>
# To get timing results run config/configure.py #<br>
# using --with-debugging=no, the performance will #<br>
# be generally two or three times faster. #<br>
# #<br>
##########################################################<br>
<br>
<br>
Event Count Time (sec) Flops<br>
--- Global --- --- Stage --- Total<br>
Max Ratio Max Ratio Max Ratio Mess Avg len<br>
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
--- Event Stage 0: Main Stage<br>
<br>
VecSet 5 1.0 1.2703e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00<br>
0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
VecAssemblyBegin 3 1.0 2.9233e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00<br>
9.0e+00 0 0 0 0 3 0 0 0 0 4 0<br>
VecAssemblyEnd 3 1.0 2.2650e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00<br>
0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
VecScatterBegin 1003 1.0 1.8717e-01 4.1 0.00e+00 0.0 6.0e+03 1.1e+03<br>
0.0e+00 1 0 97 95 0 1 0 97 95 0 0<br>
VecScatterEnd 1003 1.0 5.3403e+00 2.1 0.00e+00 0.0 0.0e+00 0.0e+00<br>
0.0e+00 20 0 0 0 0 20 0 0 0 0 0<br>
MatMult 1000 1.0 1.5877e+01 1.0 8.08e+08 1.0 6.0e+03 1.1e+03<br>
0.0e+00 80100 97 95 0 80100 97 95 0 203<br>
MatAssemblyBegin 7 1.0 3.6728e-01 1.9 0.00e+00 0.0 6.3e+01 5.0e+03<br>
1.4e+01 1 0 1 5 5 1 0 1 5 7 0<br>
MatAssemblyEnd 7 1.0 8.6817e-01 1.2 0.00e+00 0.0 8.4e+01 2.7e+02<br>
7.0e+01 4 0 1 0 24 4 0 1 0 33 0<br>
MatZeroEntries 7 1.0 5.7693e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00<br>
0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
Memory usage is given in bytes:<br>
<br>
Object Type Creations Destructions Memory Descendants' Mem.<br>
<br>
--- Event Stage 0: Main Stage<br>
<br>
Application Order 2 0 0 0<br>
Index Set 30 30 18476 0<br>
IS L to G Mapping 10 0 0 0<br>
Vec 30 7 9128 0<br>
Vec Scatter 15 0 0 0<br>
Matrix 21 0 0 0<br>
========================================================================================================================<br>
Average time to get PetscTime(): 2.14577e-07<br>
Average time for MPI_Barrier(): 5.89848e-05<br>
Average time for zero size MPI_Send(): 6.80089e-05<br>
#PETSc Option Table entries:<br>
-log_summary output1<br>
#End o PETSc Option Table entries<br>
Compiled without FORTRAN kernels<br>
Compiled with full precision matrices (default)<br>
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8<br>
sizeof(PetscScalar) 8<br>
Configure run at: Fri Jun 12 16:59:30 2009<br>
Configure options: --with-cc="gcc -fPIC" --download-mpich=1<br>
--download-f-blas-lapack --download-triangle --download-parmetis<br>
--with-hypre=1 --download-hypre=1 --with-shared=0<br>
-----------------------------------------<br>
Libraries compiled on Fri Jun 12 17:11:54 BST 2009 on christian-desktop<br>
Machine characteristics: Linux christian-desktop 2.6.27-7-generic #1 SMP<br>
Fri Oct 24 06:40:41 UTC 2008 x86_64 GNU/Linux<br>
Using PETSc directory: /home/christian/Desktop/petsc-3.0.0-p4<br>
Using PETSc arch: linux-gnu-c-debug<br>
-----------------------------------------<br>
Using C compiler:<br>
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpicc -Wall<br>
-Wwrite-strings -Wno-strict-aliasing -g3<br>
Using Fortran compiler:<br>
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpif90 -Wall<br>
-Wno-unused-variable -g<br>
-----------------------------------------<br>
Using include paths:<br>
-I/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/include<br>
-I/home/christian/Desktop/petsc-3.0.0-p4/include<br>
-I/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/include<br>
------------------------------------------<br>
Using C linker:<br>
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpicc -Wall<br>
-Wwrite-strings -Wno-strict-aliasing -g3<br>
Using Fortran linker:<br>
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpif90 -Wall<br>
-Wno-unused-variable -g<br>
Using libraries:<br>
-Wl,-rpath,/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib<br>
-L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib -lpetscts<br>
-lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc<br>
-Wl,-rpath,/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib<br>
-L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib -ltriangle<br>
-lparmetis -lmetis -lHYPRE -lmpichcxx -lstdc++ -lflapack -lfblas -lnsl<br>
-lrt -L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib<br>
-L/usr/lib/gcc/x86_64-linux-gnu/4.3.2 -L/lib -ldl -lmpich -lpthread -lrt<br>
-lgcc_s -lmpichf90 -lgfortranbegin -lgfortran -lm<br>
-L/usr/lib/gcc/x86_64-linux-gnu -lm -lmpichcxx -lstdc++ -lmpichcxx<br>
-lstdc++ -ldl -lmpich -lpthread -lrt -lgcc_s -ldl<br>
------------------------------------------<br>
<br>
************************************************************************************************************************<br>
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r<br>
-fCourier9' to print this document ***<br>
************************************************************************************************************************<br>
<br>
---------------------------------------------- PETSc Performance Summary:<br>
----------------------------------------------<br>
<br>
./ex4 on a linux-gnu named christian-desktop with 8 processors, by<br>
christian Sun Jun 14 17:13:40 2009<br>
Using Petsc Release Version 3.0.0, Patch 4, Fri Mar 6 14:46:08 CST 2009<br>
<br>
Max Max/Min Avg Total<br>
Time (sec): 1.452e+01 1.01190 1.443e+01<br>
Objects: 1.080e+02 1.00000 1.080e+02<br>
Flops: 3.739e+08 1.00373 3.731e+08 2.985e+09<br>
Flops/sec: 2.599e+07 1.01190 2.585e+07 2.068e+08<br>
Memory: 5.157e+07 1.01231 4.117e+08<br>
MPI Messages: 2.071e+03 2.00000 1.812e+03 1.450e+04<br>
MPI Message Lengths: 2.388e+06 2.00000 1.153e+03 1.672e+07<br>
MPI Reductions: 3.625e+01 1.00000<br>
<br>
Flop counting convention: 1 flop = 1 real number operation of type<br>
(multiply/divide/add/subtract)<br>
e.g., VecAXPY() for real vectors of length N<br>
--> 2N flops<br>
and VecAXPY() for complex vectors of length N<br>
--> 8N flops<br>
<br>
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages<br>
--- -- Message Lengths -- -- Reductions --<br>
Avg %Total Avg %Total counts<br>
%Total Avg %Total counts %Total<br>
0: Main Stage: 1.4431e+01 100.0% 2.9847e+09 100.0% 1.450e+04<br>
100.0% 1.153e+03 100.0% 2.120e+02 73.1%<br>
<br>
------------------------------------------------------------------------------------------------------------------------<br>
See the 'Profiling' chapter of the users' manual for details on<br>
interpreting output.<br>
Phase summary info:<br>
Count: number of times phase was executed<br>
Time and Flops: Max - maximum over all processors<br>
Ratio - ratio of maximum to minimum over all processors<br>
Mess: number of messages sent<br>
Avg. len: average message length<br>
Reduct: number of global reductions<br>
Global: entire computation<br>
Stage: stages of a computation. Set stages with PetscLogStagePush() and<br>
PetscLogStagePop().<br>
%T - percent time in this phase %F - percent flops in this<br>
phase<br>
%M - percent messages in this phase %L - percent message lengths<br>
in this phase<br>
%R - percent reductions in this phase<br>
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time<br>
over all processors)<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
<br>
##########################################################<br>
# #<br>
# WARNING!!! #<br>
# #<br>
# This code was compiled with a debugging option, #<br>
# To get timing results run config/configure.py #<br>
# using --with-debugging=no, the performance will #<br>
# be generally two or three times faster. #<br>
# #<br>
##########################################################<br>
<br>
<br>
Event Count Time (sec) Flops<br>
--- Global --- --- Stage --- Total<br>
Max Ratio Max Ratio Max Ratio Mess Avg len<br>
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
--- Event Stage 0: Main Stage<br>
<br>
VecSet 5 1.0 6.1178e-04 1.3 0.00e+00 0.0 0.0e+00 0.0e+00<br>
0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
VecAssemblyBegin 3 1.0 7.7400e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00<br>
9.0e+00 0 0 0 0 3 0 0 0 0 4 0<br>
VecAssemblyEnd 3 1.0 4.1008e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00<br>
0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
VecScatterBegin 1003 1.0 1.0858e-01 2.9 0.00e+00 0.0 1.4e+04 1.1e+03<br>
0.0e+00 1 0 97 95 0 1 0 97 95 0 0<br>
VecScatterEnd 1003 1.0 5.3962e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00<br>
0.0e+00 33 0 0 0 0 33 0 0 0 0 0<br>
MatMult 1000 1.0 1.1430e+01 1.0 3.74e+08 1.0 1.4e+04 1.1e+03<br>
0.0e+00 79100 97 95 0 79100 97 95 0 261<br>
MatAssemblyBegin 7 1.0 4.6307e-01 1.8 0.00e+00 0.0 1.5e+02 5.3e+03<br>
1.4e+01 3 0 1 5 5 3 0 1 5 7 0<br>
MatAssemblyEnd 7 1.0 6.9013e-01 1.3 0.00e+00 0.0 2.0e+02 2.8e+02<br>
7.0e+01 4 0 1 0 24 4 0 1 0 33 0<br>
MatZeroEntries 7 1.0 2.7971e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00<br>
0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
Memory usage is given in bytes:<br>
<br>
Object Type Creations Destructions Memory Descendants' Mem.<br>
<br>
--- Event Stage 0: Main Stage<br>
<br>
Application Order 2 0 0 0<br>
Index Set 30 30 18476 0<br>
IS L to G Mapping 10 0 0 0<br>
Vec 30 7 9128 0<br>
Vec Scatter 15 0 0 0<br>
Matrix 21 0 0 0<br>
========================================================================================================================<br>
Average time to get PetscTime(): 9.53674e-08<br>
Average time for MPI_Barrier(): 0.000419807<br>
Average time for zero size MPI_Send(): 0.000115991<br>
#PETSc Option Table entries:<br>
-log_summary output18<br>
#End o PETSc Option Table entries<br>
Compiled without FORTRAN kernels<br>
Compiled with full precision matrices (default)<br>
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8<br>
sizeof(PetscScalar) 8<br>
Configure run at: Fri Jun 12 16:59:30 2009<br>
Configure options: --with-cc="gcc -fPIC" --download-mpich=1<br>
--download-f-blas-lapack --download-triangle --download-parmetis<br>
--with-hypre=1 --download-hypre=1 --with-shared=0<br>
-----------------------------------------<br>
Libraries compiled on Fri Jun 12 17:11:54 BST 2009 on christian-desktop<br>
Machine characteristics: Linux christian-desktop 2.6.27-7-generic #1 SMP<br>
Fri Oct 24 06:40:41 UTC 2008 x86_64 GNU/Linux<br>
Using PETSc directory: /home/christian/Desktop/petsc-3.0.0-p4<br>
Using PETSc arch: linux-gnu-c-debug<br>
-----------------------------------------<br>
Using C compiler:<br>
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpicc -Wall<br>
-Wwrite-strings -Wno-strict-aliasing -g3<br>
Using Fortran compiler:<br>
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpif90 -Wall<br>
-Wno-unused-variable -g<br>
-----------------------------------------<br>
Using include paths:<br>
-I/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/include<br>
-I/home/christian/Desktop/petsc-3.0.0-p4/include<br>
-I/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/include<br>
------------------------------------------<br>
Using C linker:<br>
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpicc -Wall<br>
-Wwrite-strings -Wno-strict-aliasing -g3<br>
Using Fortran linker:<br>
/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/bin/mpif90 -Wall<br>
-Wno-unused-variable -g<br>
Using libraries:<br>
-Wl,-rpath,/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib<br>
-L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib -lpetscts<br>
-lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc<br>
-Wl,-rpath,/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib<br>
-L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib -ltriangle<br>
-lparmetis -lmetis -lHYPRE -lmpichcxx -lstdc++ -lflapack -lfblas -lnsl<br>
-lrt -L/home/christian/Desktop/petsc-3.0.0-p4/linux-gnu-c-debug/lib<br>
-L/usr/lib/gcc/x86_64-linux-gnu/4.3.2 -L/lib -ldl -lmpich -lpthread -lrt<br>
-lgcc_s -lmpichf90 -lgfortranbegin -lgfortran -lm<br>
-L/usr/lib/gcc/x86_64-linux-gnu -lm -lmpichcxx -lstdc++ -lmpichcxx<br>
-lstdc++ -ldl -lmpich -lpthread -lrt -lgcc_s -ldl<br>
------------------------------------------<br>
<br>
<br>
<br>
<br>
</blockquote></div><br><br clear="all"><br>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener<br>