[petsc-users] [petsc-maint] Speedup problem when using OpenMP?

Mon Nov 4 13:42:45 CST 2013

Hi Karli,

This does not make any difference. I have scaled up the matrix but the 
performance does not change. If I run with OpenMP, the iteration number 
is always the same whatever how many processors are used. This seems 
quite strange as the iteration number usually increase as the number of 
processors increased when run with MPI. I think I should move to the 
ubuntu system to make further test, to see if this is a windows problem.

Thanks,

Danyang

On 04/11/2013 6:51 AM, Karl Rupp wrote:
> Hi,
>
> > I have a question on the speedup of PETSc when using OpenMP. I can get
>> good speedup when using MPI, but no speedup when using OpenMP.
>> The example is ex2f with m=100 and n=100. The number of available
>> processors is 16 (32 threads) and the OS is Windows Server 2012. The log
>> files for 4 and 8 processors are attached.
>>
>> The commands I used to run with 4 processors are as follows:
>> Run using MPI
>> mpiexec -n 4 Petsc-windows-ex2f.exe -m 100 -n 100 -log_summary
>> log_100x100_mpi_p4.log
>>
>> Run using OpenMP
>> Petsc-windows-ex2f.exe -threadcomm_type openmp -threadcomm_nthreads 4 -m
>> 100 -n 100 -log_summary log_100x100_openmp_p4.log
>>
>> The PETSc used for this test is PETSc for Windows
>> http://www.mic-tc.ch/downloads/PETScForWindows.zip, but I guess this is
>> not the problem because the same problem exists when I use PETSc-dev in
>> Cygwin. I don't know if this problem exists in Linux, would anybody help
>> to test?
>
> For the 100x100 case considered, the execution times per call are 
> somewhere in the millisecond to sub-millisecond range (e.g. 1.3ms for 
> 68 calls to VecScale with 4 processors). I'd say this is too small in 
> order to see any reasonable performance gain when running multiple 
> threads, consider problem sizes of about 1000x1000 instead.
>
> Moreover, keep in mind that typically you won't get a perfectly linear 
> scaling with the number of processor cores, because ultimately the 
> memory bandwidth is the limiting factor for standard vector operations.
>
> Best regards,
> Karli
>

-------------- next part --------------
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

Petsc-windows-ex2f.exe on a arch-mswin-c-opt named STARGAZER2012 with 1 processor, by danyang Mon Nov 04 09:56:14 2013
With 4 threads per MPI_Comm
Using Petsc Release Version 3.4.2, Jul, 02, 2013 

                         Max       Max/Min        Avg      Total 
Time (sec):           1.851e+002      1.00000   1.851e+002
Objects:              4.500e+001      1.00000   4.500e+001
Flops:                2.203e+011      1.00000   2.203e+011  2.203e+011
Flops/sec:            1.190e+009      1.00000   1.190e+009  1.190e+009
MPI Messages:         0.000e+000      0.00000   0.000e+000  0.000e+000
MPI Message Lengths:  0.000e+000      0.00000   0.000e+000  0.000e+000
MPI Reductions:       5.236e+003      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 1.8506e+002 100.0%  2.2028e+011 100.0%  0.000e+000   0.0%  0.000e+000        0.0%  5.235e+003 100.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %f - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %f %M %L %R  %T %f %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatMult             2657 1.0 3.8941e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 21 11  0  0  0  21 11  0  0  0   614
MatSolve            2657 1.0 5.3882e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 29 11  0  0  0  29 11  0  0  0   443
MatLUFactorNum         1 1.0 1.0694e-001 1.0 1.10e+007 1.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0   103
MatILUFactorSym        1 1.0 6.7413e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 1.0e+000  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin       1 1.0 5.6889e-007 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 6.2112e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 2.2756e-006 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 9.1984e-003 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 2.0e+000  0  0  0  0  0   0  0  0  0  0     0
VecMDot             2571 1.0 3.6956e+001 1.0 7.95e+010 1.0 0.0e+000 0.0e+000 2.6e+003 20 36  0  0 49  20 36  0  0 49  2152
VecNorm             2658 1.0 9.5952e-001 1.0 5.32e+009 1.0 0.0e+000 0.0e+000 2.7e+003  1  2  0  0 51   1  2  0  0 51  5540
VecScale            2657 1.0 3.6170e+000 1.0 2.66e+009 1.0 0.0e+000 0.0e+000 0.0e+000  2  1  0  0  0   2  1  0  0  0   735
VecCopy               86 1.0 1.5124e-001 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
VecSet                88 1.0 9.7820e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
VecAXPY              172 1.0 9.2293e-002 1.0 3.44e+008 1.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0  3727
VecMAXPY            2657 1.0 4.9379e+001 1.0 8.47e+010 1.0 0.0e+000 0.0e+000 0.0e+000 27 38  0  0  0  27 38  0  0  0  1714
VecNormalize        2657 1.0 4.5800e+000 1.0 7.97e+009 1.0 0.0e+000 0.0e+000 2.7e+003  2  4  0  0 51   2  4  0  0 51  1740
KSPGMRESOrthog      2571 1.0 8.3362e+001 1.0 1.59e+011 1.0 0.0e+000 0.0e+000 2.6e+003 45 72  0  0 49  45 72  0  0 49  1908
KSPSetUp               1 1.0 2.0327e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 1.8441e+002 1.0 2.20e+011 1.0 0.0e+000 0.0e+000 5.2e+003100100  0  0100 100100  0  0100  1194
PCSetUp                1 1.0 1.8362e-001 1.0 1.10e+007 1.0 0.0e+000 0.0e+000 3.0e+000  0  0  0  0  0   0  0  0  0  0    60
PCApply             2657 1.0 5.3887e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 29 11  0  0  0  29 11  0  0  0   443
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Matrix     2              2    151957324     0
              Vector    37             37    296057128     0
       Krylov Solver     1              1        18360     0
      Preconditioner     1              1          976     0
           Index Set     3              3      4002280     0
              Viewer     1              0            0     0
========================================================================================================================
Average time to get PetscTime(): 5.68889e-008
#PETSc Option Table entries:
-log_summary log_1000x1000_openmp_p4.log
-m 1000
-n 1000
-threadcomm_nthreads 4
-threadcomm_type openmp
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure run at: Wed Oct  2 16:35:54 2013
Configure options: --with-cc="win32fe icl" --with-cxx="win32fe icl" --with-fc="win32fe ifort" --with-blas-lapack-dir=/cygdrive/d/HardLinks/PETSc/Intel2013/mkl/lib/intel64 --with-mpi-include=/cygdrive/c/MSMPI/Inc -with-mpi-lib="[/cygdrive/C/MSMPI/Lib/amd64/msmpi.lib,/cygdrive/C/MSMPI/Lib/amd64/msmpifec.lib]" --with-openmp --with-shared-libraries --with-debugging=no --useThreads=0
-----------------------------------------
Libraries compiled on Wed Oct  2 16:35:54 2013 on NB-TT-113812 
Machine characteristics: CYGWIN_NT-6.1-WOW64-1.7.25-0.270-5-3-i686-32bit
Using PETSc directory: /cygdrive/d/WorkDir/petsc-3.4.2
Using PETSc arch: arch-mswin-c-opt
-----------------------------------------

Using C compiler: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe icl  -MT -O3 -QxW -Qopenmp  ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe ifort  -MT -O3 -QxW -fpp -Qopenmp  ${FOPTFLAGS} ${FFLAGS} 
-----------------------------------------

Using include paths: -I/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/include -I/cygdrive/d/WorkDir/petsc-3.4.2/include -I/cygdrive/d/WorkDir/petsc-3.4.2/include -I/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/include -I/cygdrive/c/MSMPI/Inc
-----------------------------------------

Using C linker: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe icl
Using Fortran linker: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe ifort
Using libraries: -L/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/lib -L/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/lib -lpetsc /cygdrive/d/HardLinks/PETSc/Intel2013/mkl/lib/intel64/mkl_intel_lp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib /cygdrive/C/MSMPI/Lib/amd64/msmpi.lib /cygdrive/C/MSMPI/Lib/amd64/msmpifec.lib Gdi32.lib User32.lib Advapi32.lib Kernel32.lib Ws2_32.lib 
-----------------------------------------

-------------- next part --------------
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

Petsc-windows-ex2f.exe on a arch-mswin-c-opt named STARGAZER2012 with 1 processor, by danyang Mon Nov 04 10:04:07 2013
With 8 threads per MPI_Comm
Using Petsc Release Version 3.4.2, Jul, 02, 2013 

                         Max       Max/Min        Avg      Total 
Time (sec):           1.717e+002      1.00000   1.717e+002
Objects:              4.500e+001      1.00000   4.500e+001
Flops:                2.203e+011      1.00000   2.203e+011  2.203e+011
Flops/sec:            1.283e+009      1.00000   1.283e+009  1.283e+009
MPI Messages:         0.000e+000      0.00000   0.000e+000  0.000e+000
MPI Message Lengths:  0.000e+000      0.00000   0.000e+000  0.000e+000
MPI Reductions:       5.236e+003      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 1.7167e+002 100.0%  2.2028e+011 100.0%  0.000e+000   0.0%  0.000e+000        0.0%  5.235e+003 100.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %f - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %f %M %L %R  %T %f %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatMult             2657 1.0 3.5848e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 21 11  0  0  0  21 11  0  0  0   666
MatSolve            2657 1.0 4.9704e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 29 11  0  0  0  29 11  0  0  0   481
MatLUFactorNum         1 1.0 1.0581e-001 1.0 1.10e+007 1.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0   104
MatILUFactorSym        1 1.0 6.6883e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 1.0e+000  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin       1 1.0 5.6889e-007 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 5.9349e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 1.7067e-006 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 8.4725e-003 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 2.0e+000  0  0  0  0  0   0  0  0  0  0     0
VecMDot             2571 1.0 3.4147e+001 1.0 7.95e+010 1.0 0.0e+000 0.0e+000 2.6e+003 20 36  0  0 49  20 36  0  0 49  2328
VecNorm             2658 1.0 1.1897e+000 1.0 5.32e+009 1.0 0.0e+000 0.0e+000 2.7e+003  1  2  0  0 51   1  2  0  0 51  4468
VecScale            2657 1.0 3.6916e+000 1.0 2.66e+009 1.0 0.0e+000 0.0e+000 0.0e+000  2  1  0  0  0   2  1  0  0  0   720
VecCopy               86 1.0 1.4144e-001 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
VecSet                88 1.0 9.3055e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
VecAXPY              172 1.0 1.5342e-001 1.0 3.44e+008 1.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0  2242
VecMAXPY            2657 1.0 4.5747e+001 1.0 8.47e+010 1.0 0.0e+000 0.0e+000 0.0e+000 27 38  0  0  0  27 38  0  0  0  1851
VecNormalize        2657 1.0 4.8846e+000 1.0 7.97e+009 1.0 0.0e+000 0.0e+000 2.7e+003  3  4  0  0 51   3  4  0  0 51  1632
KSPGMRESOrthog      2571 1.0 7.7191e+001 1.0 1.59e+011 1.0 0.0e+000 0.0e+000 2.6e+003 45 72  0  0 49  45 72  0  0 49  2060
KSPSetUp               1 1.0 1.7942e-002 1.0 0.00e+000 0.0 0.0e+000 0.0e+000 0.0e+000  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 1.7101e+002 1.0 2.20e+011 1.0 0.0e+000 0.0e+000 5.2e+003100100  0  0100 100100  0  0100  1288
PCSetUp                1 1.0 1.8123e-001 1.0 1.10e+007 1.0 0.0e+000 0.0e+000 3.0e+000  0  0  0  0  0   0  0  0  0  0    61
PCApply             2657 1.0 4.9709e+001 1.0 2.39e+010 1.0 0.0e+000 0.0e+000 0.0e+000 29 11  0  0  0  29 11  0  0  0   481
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Matrix     2              2    151957324     0
              Vector    37             37    296057128     0
       Krylov Solver     1              1        18360     0
      Preconditioner     1              1          976     0
           Index Set     3              3      4002280     0
              Viewer     1              0            0     0
========================================================================================================================
Average time to get PetscTime(): 5.68889e-008
#PETSc Option Table entries:
-log_summary log_1000x1000_openmp_p8.log
-m 1000
-n 1000
-threadcomm_nthreads 8
-threadcomm_type openmp
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure run at: Wed Oct  2 16:35:54 2013
Configure options: --with-cc="win32fe icl" --with-cxx="win32fe icl" --with-fc="win32fe ifort" --with-blas-lapack-dir=/cygdrive/d/HardLinks/PETSc/Intel2013/mkl/lib/intel64 --with-mpi-include=/cygdrive/c/MSMPI/Inc -with-mpi-lib="[/cygdrive/C/MSMPI/Lib/amd64/msmpi.lib,/cygdrive/C/MSMPI/Lib/amd64/msmpifec.lib]" --with-openmp --with-shared-libraries --with-debugging=no --useThreads=0
-----------------------------------------
Libraries compiled on Wed Oct  2 16:35:54 2013 on NB-TT-113812 
Machine characteristics: CYGWIN_NT-6.1-WOW64-1.7.25-0.270-5-3-i686-32bit
Using PETSc directory: /cygdrive/d/WorkDir/petsc-3.4.2
Using PETSc arch: arch-mswin-c-opt
-----------------------------------------

Using C compiler: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe icl  -MT -O3 -QxW -Qopenmp  ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe ifort  -MT -O3 -QxW -fpp -Qopenmp  ${FOPTFLAGS} ${FFLAGS} 
-----------------------------------------

Using include paths: -I/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/include -I/cygdrive/d/WorkDir/petsc-3.4.2/include -I/cygdrive/d/WorkDir/petsc-3.4.2/include -I/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/include -I/cygdrive/c/MSMPI/Inc
-----------------------------------------

Using C linker: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe icl
Using Fortran linker: /cygdrive/d/WorkDir/petsc-3.4.2/bin/win32fe/win32fe ifort
Using libraries: -L/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/lib -L/cygdrive/d/WorkDir/petsc-3.4.2/arch-mswin-c-opt/lib -lpetsc /cygdrive/d/HardLinks/PETSc/Intel2013/mkl/lib/intel64/mkl_intel_lp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib /cygdrive/C/MSMPI/Lib/amd64/msmpi.lib /cygdrive/C/MSMPI/Lib/amd64/msmpifec.lib Gdi32.lib User32.lib Advapi32.lib Kernel32.lib Ws2_32.lib 
-----------------------------------------