[petsc-dev] OpenMP in PETSc when calling from Fortran?

Wed Mar 6 15:38:22 CST 2013

   I don't see any options for turning on the threads here?

  #PETSc Option Table entries:
-ksp_type bcgs
-log_summary
-pc_type lu
#End of PETSc Option Table entries

  From http://www.mcs.anl.gov/petsc/features/threads.html

	• The three important run-time options for using threads are:
		• -threadcomm_nthreads <nthreads>: Sets the number of threads
		• -threadcomm_affinities <list_of_affinities>: Sets the core affinities of threads
		• -threadcomm_type <nothread,pthread,openmp>: Threading model (OpenMP, pthread, nothread)
	• Run with -help to see the avialable options with threads.
	• A few tutorial examples are located at $PETSC_DIR/src/sys/threadcomm/examples/tutorials

  Also LU is a direct solver that is not threaded so using threads for this exact run will not help (much) at all. The threads will only show useful speed up for iterative methods.

   Barry

  As time goes by we hope to have more extensive support in more routines for threads but things like factorization and solve are difficult so out side help would be very useful.

On Mar 6, 2013, at 3:39 AM, Åsmund Ervik <Asmund.Ervik at sintef.no> wrote:

> Hi again,
> 
> On 01. mars 2013 20:06, Jed Brown wrote:
>> 
>> Matrix and vector operations are probably running in parallel, but probably
>> not the operations that are taking time. Always send -log_summary if you
>> have a performance question.
>> 
> 
> I don't think they are running in parallel. When I analyze my code in
> Intel Vtune Amplifier, the only routines running in parallel are my own
> OpenMP ones. Indeed, if I comment out my OpenMP pragmas and recompile my
> code, it never uses more than one thread.
> 
> -log_summary is shown below; this is using -pc_type lu -ksp_type bcgs.
> The fastest PC for my cases is usually BoomerAMG from HYPRE, so i used
> LU instead here in order to limit the test to PETSc only. The summary
> agrees with Vtune that MatLUFactorNumeric is the most time-consuming
> routine; in general it seems that the PC is always the most time-consuming.
> 
> Any advice on how to get OpenMP working?
> 
> Regards,
> Åsmund
> 
> 
> 
> ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
> 
> ./run on a arch-linux2-c-opt named vsl161 with 1 processor, by asmunder
> Wed Mar  6 10:14:55 2013
> Using Petsc Development HG revision:
> 58cc6199509f1642f637843f1ca468283bf5ced9  HG Date: Wed Jan 30 00:39:35
> 2013 -0600
> 
>                         Max       Max/Min        Avg      Total
> Time (sec):           4.446e+02      1.00000   4.446e+02
> Objects:              2.017e+03      1.00000   2.017e+03
> Flops:                3.919e+11      1.00000   3.919e+11  3.919e+11
> Flops/sec:            8.815e+08      1.00000   8.815e+08  8.815e+08
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       2.818e+03      1.00000
> 
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>                            and VecAXPY() for complex vectors of length
> N --> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 4.4460e+02 100.0%  3.9191e+11 100.0%  0.000e+00
> 0.0%  0.000e+00        0.0%  2.817e+03 100.0%
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>   Mess: number of messages sent
>   Avg. len: average message length (bytes)
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
>      %T - percent time in this phase         %f - percent flops in this
> phase
>      %M - percent messages in this phase     %L - percent message
> lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops
>         --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %f %M %L %R  %T %f %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> VecDot               802 1.0 9.2811e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2117
> VecDotNorm2          401 1.0 7.1333e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00
> 4.0e+02  0  0  0  0 14   0  0  0  0 14  2755
> VecNorm             1203 1.0 7.8265e-02 1.0 2.95e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  3766
> VecCopy              802 1.0 1.1754e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              1211 1.0 9.9961e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY              401 1.0 4.5847e-02 1.0 9.82e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2143
> VecAXPBYCZ           802 1.0 1.3489e-01 1.0 3.93e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2913
> VecWAXPY             802 1.0 1.2292e-01 1.0 1.96e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1599
> VecAssemblyBegin     802 1.0 2.4509e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyEnd       802 1.0 6.7234e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMult             1203 1.0 1.1513e+00 1.0 1.32e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1149
> MatSolve            1604 1.0 1.4714e+01 1.0 2.07e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00  3  5  0  0  0   3  5  0  0  0  1405
> MatLUFactorSym       401 1.0 4.0197e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.2e+03  9  0  0  0 43   9  0  0  0 43     0
> MatLUFactorNum       401 1.0 2.3728e+02 1.0 3.69e+11 1.0 0.0e+00 0.0e+00
> 0.0e+00 53 94  0  0  0  53 94  0  0  0  1553
> MatAssemblyBegin     401 1.0 1.7977e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd       401 1.0 3.1975e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ          401 1.0 9.1545e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering       401 1.0 2.0361e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 8.0e+02  5  0  0  0 28   5  0  0  0 28     0
> KSPSetUp             401 1.0 4.1821e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve             401 1.0 3.1511e+02 1.0 3.92e+11 1.0 0.0e+00 0.0e+00
> 2.8e+03 71100  0  0100  71100  0  0100  1244
> PCSetUp              401 1.0 2.9844e+02 1.0 3.69e+11 1.0 0.0e+00 0.0e+00
> 2.0e+03 67 94  0  0 71  67 94  0  0 71  1235
> PCApply             1604 1.0 1.4717e+01 1.0 2.07e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00  3  5  0  0  0   3  5  0  0  0  1405
> ------------------------------------------------------------------------------------------------------------------------
> 
> Memory usage is given in bytes:
> 
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
> 
> --- Event Stage 0: Main Stage
> 
>              Vector   409            409    401422048     0
>              Matrix   402            402  31321054412     0
>       Krylov Solver     1              1         1128     0
>      Preconditioner     1              1         1152     0
>           Index Set  1203           1203    393903904     0
>              Viewer     1              0            0     0
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> #PETSc Option Table entries:
> -ksp_type bcgs
> -log_summary
> -pc_type lu
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure run at: Fri Mar  1 12:53:06 2013
> Configure options: --with-pthreadclasses --with-openmp
> --with-debugging=0 --with-shared-libraries=1 --download-mpich
> --download-hypre --with-boost-dir=/usr COPTFLAGS=-O3 FOPTFLAGS=-O3
> -----------------------------------------
> Libraries compiled on Fri Mar  1 12:53:06 2013 on vsl161
> Machine characteristics: Linux-3.7.9-1-ARCH-x86_64-with-glibc2.2.5
> Using PETSc directory: /opt/petsc/petsc-dev-install
> Using PETSc arch: arch-linux2-c-opt
> -----------------------------------------
> 
> Using C compiler:
> /opt/petsc/petsc-dev-install/arch-linux2-c-opt/bin/mpicc  -fPIC -Wall
> -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O3 -fopenmp
> ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler:
> /opt/petsc/petsc-dev-install/arch-linux2-c-opt/bin/mpif90  -fPIC  -Wall
> -Wno-unused-variable -Wno-unused-dummy-argument -O3 -fopenmp
> ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
> 
> Using include paths:
> -I/opt/petsc/petsc-dev-install/arch-linux2-c-opt/include
> -I/opt/petsc/petsc-dev-install/include
> -I/opt/petsc/petsc-dev-install/include
> -I/opt/petsc/petsc-dev-install/arch-linux2-c-opt/include -I/usr/include
> -----------------------------------------
> 
> Using C linker: /opt/petsc/petsc-dev-install/arch-linux2-c-opt/bin/mpicc
> Using Fortran linker:
> /opt/petsc/petsc-dev-install/arch-linux2-c-opt/bin/mpif90
> Using libraries:
> -Wl,-rpath,/opt/petsc/petsc-dev-install/arch-linux2-c-opt/lib
> -L/opt/petsc/petsc-dev-install/arch-linux2-c-opt/lib -lpetsc
> -Wl,-rpath,/opt/petsc/petsc-dev-install/arch-linux2-c-opt/lib
> -L/opt/petsc/petsc-dev-install/arch-linux2-c-opt/lib -lHYPRE
> -Wl,-rpath,/usr/lib/gcc/x86_64-unknown-linux-gnu/4.7.2
> -L/usr/lib/gcc/x86_64-unknown-linux-gnu/4.7.2
> -Wl,-rpath,/opt/intel/composer_xe_2013.1.117/compiler/lib/intel64
> -L/opt/intel/composer_xe_2013.1.117/compiler/lib/intel64
> -Wl,-rpath,/opt/intel/composer_xe_2013.1.117/ipp/lib/intel64
> -L/opt/intel/composer_xe_2013.1.117/ipp/lib/intel64
> -Wl,-rpath,/opt/intel/composer_xe_2013.1.117/mkl/lib/intel64
> -L/opt/intel/composer_xe_2013.1.117/mkl/lib/intel64
> -Wl,-rpath,/opt/intel/composer_xe_2013.1.117/tbb/lib/intel64
> -L/opt/intel/composer_xe_2013.1.117/tbb/lib/intel64 -lmpichcxx -lstdc++
> -llapack -lblas -lX11 -lpthread -lm -lmpichf90 -lgfortran -lm -lgfortran
> -lm -lquadmath -lm -lmpichcxx -lstdc++ -ldl -lmpich -lopa -lmpl -lrt
> -lgcc_s -ldl
> -----------------------------------------
> 
> 
> 
> 
> 
>