[petsc-users] [petsc-maint] Speedup problem when using OpenMP?
Danyang Su
danyang.su at gmail.com
Mon Nov 4 13:42:45 CST 2013
Hi Karli,
This does not make any difference. I have scaled up the matrix but the
performance does not change. If I run with OpenMP, the iteration number
is always the same whatever how many processors are used. This seems
quite strange as the iteration number usually increase as the number of
processors increased when run with MPI. I think I should move to the
ubuntu system to make further test, to see if this is a windows problem.
Thanks,
Danyang
On 04/11/2013 6:51 AM, Karl Rupp wrote:
> Hi,
>
> > I have a question on the speedup of PETSc when using OpenMP. I can get
>> good speedup when using MPI, but no speedup when using OpenMP.
>> The example is ex2f with m=100 and n=100. The number of available
>> processors is 16 (32 threads) and the OS is Windows Server 2012. The log
>> files for 4 and 8 processors are attached.
>>
>> The commands I used to run with 4 processors are as follows:
>> Run using MPI
>> mpiexec -n 4 Petsc-windows-ex2f.exe -m 100 -n 100 -log_summary
>> log_100x100_mpi_p4.log
>>
>> Run using OpenMP
>> Petsc-windows-ex2f.exe -threadcomm_type openmp -threadcomm_nthreads 4 -m
>> 100 -n 100 -log_summary log_100x100_openmp_p4.log
>>
>> The PETSc used for this test is PETSc for Windows
>> http://www.mic-tc.ch/downloads/PETScForWindows.zip, but I guess this is
>> not the problem because the same problem exists when I use PETSc-dev in
>> Cygwin. I don't know if this problem exists in Linux, would anybody help
>> to test?
>
> For the 100x100 case considered, the execution times per call are
> somewhere in the millisecond to sub-millisecond range (e.g. 1.3ms for
> 68 calls to VecScale with 4 processors). I'd say this is too small in
> order to see any reasonable performance gain when running multiple
> threads, consider problem sizes of about 1000x1000 instead.
>
> Moreover, keep in mind that typically you won't get a perfectly linear
> scaling with the number of processor cores, because ultimately the
> memory bandwidth is the limiting factor for standard vector operations.
>
> Best regards,
> Karli
>
-------------- next part --------------
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
Petsc-windows-ex2f.exe on a arch-mswin-c-opt named STARGAZER2012 with 1 processor, by danyang Mon Nov 04 09:56:14 2013
With 4 threads per MPI_Comm
Using Petsc Release Version 3.4.2, Jul, 02, 2013
Max Max/Min Avg Total
Time (sec): 1.851e+002 1.00000 1.851e+002
Objects: 4.500e+001 1.00000 4.500e+001
Flops: 2.203e+011 1.00000 2.203e+011 2.203e+011
Flops/sec: 1.190e+009 1.00000 1.190e+009 1.190e+009
MPI Messages: 0.000e+000 0.00000 0.000e+000 0.000e+000
MPI Message Lengths: 0.000e+000 0.00000 0.000e+000 0.000e+000
MPI Reductions: 5.236e+003 1.00000
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flops
and VecAXPY() for complex vectors of length N --> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total Avg %Total counts %Total
0: Main Stage: 1.8506e+002 100.0% 2.2028e+011 100.0% 0.000e+000 0.0% 0.000e+000 0.0% 5.235e+003 100.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %f - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
MatMult 2657 1.0 3.8941e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 21 11 0 0 0 21 11 0 0 0 614
MatSolve 2657 1.0 5.3882e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 29 11 0 0 0 29 11 0 0 0 443
MatLUFactorNum 1 1.0 1.0694e-001 1.0 1.10e+007 1.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 103
MatILUFactorSym 1 1.0 6.7413e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 1.0e+000 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 1 1.0 5.6889e-007 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 6.2112e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 2.2756e-006 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 9.1984e-003 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 2.0e+000 0 0 0 0 0 0 0 0 0 0 0
VecMDot 2571 1.0 3.6956e+001 1.0 7.95e+010 1.0 0.0e+000 0.0e+000 2.6e+003 20 36 0 0 49 20 36 0 0 49 2152
VecNorm 2658 1.0 9.5952e-001 1.0 5.32e+009 1.0 0.0e+000 0.0e+000 2.7e+003 1 2 0 0 51 1 2 0 0 51 5540
VecScale 2657 1.0 3.6170e+000 1.0 2.66e+009 1.0 0.0e+000 0.0e+000 0.0e+000 2 1 0 0 0 2 1 0 0 0 735
VecCopy 86 1.0 1.5124e-001 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
VecSet 88 1.0 9.7820e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 172 1.0 9.2293e-002 1.0 3.44e+008 1.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 3727
VecMAXPY 2657 1.0 4.9379e+001 1.0 8.47e+010 1.0 0.0e+000 0.0e+000 0.0e+000 27 38 0 0 0 27 38 0 0 0 1714
VecNormalize 2657 1.0 4.5800e+000 1.0 7.97e+009 1.0 0.0e+000 0.0e+000 2.7e+003 2 4 0 0 51 2 4 0 0 51 1740
KSPGMRESOrthog 2571 1.0 8.3362e+001 1.0 1.59e+011 1.0 0.0e+000 0.0e+000 2.6e+003 45 72 0 0 49 45 72 0 0 49 1908
KSPSetUp 1 1.0 2.0327e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 1.8441e+002 1.0 2.20e+011 1.0 0.0e+000 0.0e+000 5.2e+003100100 0 0100 100100 0 0100 1194
PCSetUp 1 1.0 1.8362e-001 1.0 1.10e+007 1.0 0.0e+000 0.0e+000 3.0e+000 0 0 0 0 0 0 0 0 0 0 60
PCApply 2657 1.0 5.3887e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 29 11 0 0 0 29 11 0 0 0 443
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Matrix 2 2 151957324 0
Vector 37 37 296057128 0
Krylov Solver 1 1 18360 0
Preconditioner 1 1 976 0
Index Set 3 3 4002280 0
Viewer 1 0 0 0
========================================================================================================================
Average time to get PetscTime(): 5.68889e-008
#PETSc Option Table entries:
-log_summary log_1000x1000_openmp_p4.log
-m 1000
-n 1000
-threadcomm_nthreads 4
-threadcomm_type openmp
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure run at: Wed Oct 2 16:35:54 2013
Configure options: --with-cc="win32fe icl" --with-cxx="win32fe icl" --with-fc="win32fe ifort" --with-blas-lapack-dir=/cygdrive/d/HardLinks/PETSc/Intel2013/mkl/lib/intel64 --with-mpi-include=/cygdrive/c/MSMPI/Inc -with-mpi-lib="[/cygdrive/C/MSMPI/Lib/amd64/msmpi.lib,/cygdrive/C/MSMPI/Lib/amd64/msmpifec.lib]" --with-openmp --with-shared-libraries --with-debugging=no --useThreads=0
-----------------------------------------
Libraries compiled on Wed Oct 2 16:35:54 2013 on NB-TT-113812
Machine characteristics: CYGWIN_NT-6.1-WOW64-1.7.25-0.270-5-3-i686-32bit
Using PETSc directory: /cygdrive/d/WorkDir/petsc-3.4.2
Using PETSc arch: arch-mswin-c-opt
-----------------------------------------
Using C compiler: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe icl -MT -O3 -QxW -Qopenmp ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe ifort -MT -O3 -QxW -fpp -Qopenmp ${FOPTFLAGS} ${FFLAGS}
-----------------------------------------
Using include paths: -I/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/include -I/cygdrive/d/WorkDir/petsc-3.4.2/include -I/cygdrive/d/WorkDir/petsc-3.4.2/include -I/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/include -I/cygdrive/c/MSMPI/Inc
-----------------------------------------
Using C linker: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe icl
Using Fortran linker: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe ifort
Using libraries: -L/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/lib -L/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/lib -lpetsc /cygdrive/d/HardLinks/PETSc/Intel2013/mkl/lib/intel64/mkl_intel_lp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib /cygdrive/C/MSMPI/Lib/amd64/msmpi.lib /cygdrive/C/MSMPI/Lib/amd64/msmpifec.lib Gdi32.lib User32.lib Advapi32.lib Kernel32.lib Ws2_32.lib
-----------------------------------------
-------------- next part --------------
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
Petsc-windows-ex2f.exe on a arch-mswin-c-opt named STARGAZER2012 with 1 processor, by danyang Mon Nov 04 10:04:07 2013
With 8 threads per MPI_Comm
Using Petsc Release Version 3.4.2, Jul, 02, 2013
Max Max/Min Avg Total
Time (sec): 1.717e+002 1.00000 1.717e+002
Objects: 4.500e+001 1.00000 4.500e+001
Flops: 2.203e+011 1.00000 2.203e+011 2.203e+011
Flops/sec: 1.283e+009 1.00000 1.283e+009 1.283e+009
MPI Messages: 0.000e+000 0.00000 0.000e+000 0.000e+000
MPI Message Lengths: 0.000e+000 0.00000 0.000e+000 0.000e+000
MPI Reductions: 5.236e+003 1.00000
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flops
and VecAXPY() for complex vectors of length N --> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total Avg %Total counts %Total
0: Main Stage: 1.7167e+002 100.0% 2.2028e+011 100.0% 0.000e+000 0.0% 0.000e+000 0.0% 5.235e+003 100.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %f - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
MatMult 2657 1.0 3.5848e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 21 11 0 0 0 21 11 0 0 0 666
MatSolve 2657 1.0 4.9704e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 29 11 0 0 0 29 11 0 0 0 481
MatLUFactorNum 1 1.0 1.0581e-001 1.0 1.10e+007 1.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 104
MatILUFactorSym 1 1.0 6.6883e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 1.0e+000 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 1 1.0 5.6889e-007 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 5.9349e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 1.7067e-006 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 8.4725e-003 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 2.0e+000 0 0 0 0 0 0 0 0 0 0 0
VecMDot 2571 1.0 3.4147e+001 1.0 7.95e+010 1.0 0.0e+000 0.0e+000 2.6e+003 20 36 0 0 49 20 36 0 0 49 2328
VecNorm 2658 1.0 1.1897e+000 1.0 5.32e+009 1.0 0.0e+000 0.0e+000 2.7e+003 1 2 0 0 51 1 2 0 0 51 4468
VecScale 2657 1.0 3.6916e+000 1.0 2.66e+009 1.0 0.0e+000 0.0e+000 0.0e+000 2 1 0 0 0 2 1 0 0 0 720
VecCopy 86 1.0 1.4144e-001 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
VecSet 88 1.0 9.3055e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 172 1.0 1.5342e-001 1.0 3.44e+008 1.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 2242
VecMAXPY 2657 1.0 4.5747e+001 1.0 8.47e+010 1.0 0.0e+000 0.0e+000 0.0e+000 27 38 0 0 0 27 38 0 0 0 1851
VecNormalize 2657 1.0 4.8846e+000 1.0 7.97e+009 1.0 0.0e+000 0.0e+000 2.7e+003 3 4 0 0 51 3 4 0 0 51 1632
KSPGMRESOrthog 2571 1.0 7.7191e+001 1.0 1.59e+011 1.0 0.0e+000 0.0e+000 2.6e+003 45 72 0 0 49 45 72 0 0 49 2060
KSPSetUp 1 1.0 1.7942e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 1.7101e+002 1.0 2.20e+011 1.0 0.0e+000 0.0e+000 5.2e+003100100 0 0100 100100 0 0100 1288
PCSetUp 1 1.0 1.8123e-001 1.0 1.10e+007 1.0 0.0e+000 0.0e+000 3.0e+000 0 0 0 0 0 0 0 0 0 0 61
PCApply 2657 1.0 4.9709e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 29 11 0 0 0 29 11 0 0 0 481
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Matrix 2 2 151957324 0
Vector 37 37 296057128 0
Krylov Solver 1 1 18360 0
Preconditioner 1 1 976 0
Index Set 3 3 4002280 0
Viewer 1 0 0 0
========================================================================================================================
Average time to get PetscTime(): 5.68889e-008
#PETSc Option Table entries:
-log_summary log_1000x1000_openmp_p8.log
-m 1000
-n 1000
-threadcomm_nthreads 8
-threadcomm_type openmp
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure run at: Wed Oct 2 16:35:54 2013
Configure options: --with-cc="win32fe icl" --with-cxx="win32fe icl" --with-fc="win32fe ifort" --with-blas-lapack-dir=/cygdrive/d/HardLinks/PETSc/Intel2013/mkl/lib/intel64 --with-mpi-include=/cygdrive/c/MSMPI/Inc -with-mpi-lib="[/cygdrive/C/MSMPI/Lib/amd64/msmpi.lib,/cygdrive/C/MSMPI/Lib/amd64/msmpifec.lib]" --with-openmp --with-shared-libraries --with-debugging=no --useThreads=0
-----------------------------------------
Libraries compiled on Wed Oct 2 16:35:54 2013 on NB-TT-113812
Machine characteristics: CYGWIN_NT-6.1-WOW64-1.7.25-0.270-5-3-i686-32bit
Using PETSc directory: /cygdrive/d/WorkDir/petsc-3.4.2
Using PETSc arch: arch-mswin-c-opt
-----------------------------------------
Using C compiler: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe icl -MT -O3 -QxW -Qopenmp ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe ifort -MT -O3 -QxW -fpp -Qopenmp ${FOPTFLAGS} ${FFLAGS}
-----------------------------------------
Using include paths: -I/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/include -I/cygdrive/d/WorkDir/petsc-3.4.2/include -I/cygdrive/d/WorkDir/petsc-3.4.2/include -I/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/include -I/cygdrive/c/MSMPI/Inc
-----------------------------------------
Using C linker: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe icl
Using Fortran linker: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe ifort
Using libraries: -L/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/lib -L/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/lib -lpetsc /cygdrive/d/HardLinks/PETSc/Intel2013/mkl/lib/intel64/mkl_intel_lp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib /cygdrive/C/MSMPI/Lib/amd64/msmpi.lib /cygdrive/C/MSMPI/Lib/amd64/msmpifec.lib Gdi32.lib User32.lib Advapi32.lib Kernel32.lib Ws2_32.lib
-----------------------------------------
More information about the petsc-users
mailing list