[petsc-users] Using OpenMP threads with PETSc

Lucas Clemente Vella lvella at gmail.com
Thu Apr 9 16:30:51 CDT 2015


As I understand, from reading a paper form Weiland, Mitchell, Parsons,
Gorman and Kramer, entitled "Mixed-mode implementation of PETSc for
scalable linear algebra on multi-core processors", they have
implemented thread level parallelism inside Mat and Vec operations of
PETSc, and it is available on master branch of the PETSc development
version.

I have a small benchmark program that loads a matrix into PETSc and
solves it with MPI. I was hoping to be able to easily use the PETSc
OpenMP threaded parallelism, preferably without having to change how I
use PETSc from my program, but I was unable to do so.

I have downloaded and built master branch from GIT with OpenMP
enabled, configured like this (and it was a pain to have the have it
to complete the configure with the right libraries, compilers and
flags):

./configure --PETSC_DIR=/home/lvella/src/petsc --with-clanguage=C
--FFLAGS="-Ofast -march=ivybridge -mmmx -mno-3dnow -msse -msse2 -msse3
-mssse3 -mno-sse4a -mcx16 -msahf -mno-movbe -maes -mno-sha -mpclmul
-mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi
-mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt
-mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed -mno-prfchw
-mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er
-mno-avx512cd -mno-avx512pf -mno-prefetchwt1 --param l1-cache-size=32
--param l1-cache-line-size=64 --param l2-cache-size=25600
-mtune=ivybridge -I/opt/sw/openmpi/1.8.4-gcc4.9.1/include/"
--CFLAGS="-Ofast -march=ivybridge -mmmx -mno-3dnow -msse -msse2 -msse3
-mssse3 -mno-sse4a -mcx16 -msahf -mno-movbe -maes -mno-sha -mpclmul
-mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi
-mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt
-mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed -mno-prfchw
-mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er
-mno-avx512cd -mno-avx512pf -mno-prefetchwt1 --param l1-cache-size=32
--param l1-cache-line-size=64 --param l2-cache-size=25600
-mtune=ivybridge" --CXXFLAGS="-Ofast -march=ivybridge -mmmx -mno-3dnow
-msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mno-movbe -maes
-mno-sha -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4
-mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2 -msse4.1
-mno-lzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed
-mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f
-mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 --param
l1-cache-size=32 --param l1-cache-line-size=64 --param
l2-cache-size=25600 -mtune=ivybridge" --with-etags=0
--with-visibility=1 --with-hypre=1
--with-hypre-include=/home/lvella/src/hypre-2.10.0b/src/hypre/include/
--with-hypre-lib=/home/lvella/src/hypre-2.10.0b/src/hypre/lib/libHYPRE.a
--with-openmp=1 --with-pthread=1 --download-fblaslapack
-with-cc=/opt/sw/gcc/4.9/bin/gcc-4.9 -with-cxx=0
-with-fc=/opt/sw/gcc/4.9/bin/gfortran-4.9 --with-mpi-compilers=0
--known-mpi-shared-libraries=1 --with-mpiuni-fortran-binding=0
--FC_LINKER_FLAGS=-lmpi_mpifh --with-debugging=0
--with-pthreadclasses=1 --with-threadcomm=1

It is hard to find in there, but all these relevant flags are there:
--with-openmp=1 --with-pthread=1 --with-pthreadclasses=1 --with-threadcomm=1

For the later two, I was not too certain of why they were needed, but
I couldn't use pthreads without them.

Now, to run my program, I had several attempts (with just two cores to
see if everything was right). The first one:

export OMP_NUM_THREADS=2
./solve-only -ksp_type bcgs -pc_type asm -pc_asm_overlap 12
-pc_asm_type restrict -sub_pc_type ilu -sub_pc_factor_levels 1
-sub_ksp_type preonly

With this first attempt, the program runs, but there is no indication
of any use of threads. The threads are there, but CPU usage on top is
never above 100%.

The second attempt was:
export OMP_NUM_THREADS=2
./solve-only -ksp_type bcgs -pc_type asm -pc_asm_overlap 12
-pc_asm_type restrict -sub_pc_type ilu -sub_pc_factor_levels 1
-sub_ksp_type preonly -threadcomm_nthreads $OMP_NUM_THREADS

Now, the program is killed with the message:

[0]PETSC ERROR: --------------------- Error Message
--------------------------------------------------------------
[0]PETSC ERROR: Invalid argument
[0]PETSC ERROR: Cannot have more than 1 thread for the nonthread
communicator,threads requested = 2
[0]PETSC ERROR: See
http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.5.3-2611-g88a78ad
GIT Date: 2015-04-08 11:48:47 -0500
[0]PETSC ERROR: ../../solve-only on a arch-linux2-c-opt named r1i0n0
by lvella Thu Apr  9 18:13:18 2015
[0]PETSC ERROR: Configure options [...]
[0]PETSC ERROR: #1 PetscThreadCommCreate_NoThread() line 10 in
/home/lvella/src/petsc/src/sys/threadcomm/impls/nothread/nothread.c
[0]PETSC ERROR: #2 PetscThreadCommSetType() line 507 in
/home/lvella/src/petsc/src/sys/threadcomm/interface/threadcomm.c
[0]PETSC ERROR: #3 PetscThreadCommWorldInitialize() line 1242 in
/home/lvella/src/petsc/src/sys/threadcomm/interface/threadcomm.c
[0]PETSC ERROR: #4 PetscGetThreadCommWorld() line 82 in
/home/lvella/src/petsc/src/sys/threadcomm/interface/threadcomm.c
[0]PETSC ERROR: #5 PetscCommGetThreadComm() line 117 in
/home/lvella/src/petsc/src/sys/threadcomm/interface/threadcomm.c
[0]PETSC ERROR: #6 PetscCommDuplicate() line 195 in
/home/lvella/src/petsc/src/sys/objects/tagm.c
[0]PETSC ERROR: #7 PetscHeaderCreate_Private() line 60 in
/home/lvella/src/petsc/src/sys/objects/inherit.c
[0]PETSC ERROR: #8 MatCreate() line 84 in
/home/lvella/src/petsc/src/mat/utils/gcreate.c
[0]PETSC ERROR:
------------------------------------------------------------------------
[...]

For the next attempt, I've added the option "-threadcomm_type openmp",
as the page http://www.mcs.anl.gov/petsc/features/threads.html says...

export OMP_NUM_THREADS=2
./solve-only -ksp_type bcgs -pc_type asm -pc_asm_overlap 12
-pc_asm_type restrict -sub_pc_type ilu -sub_pc_factor_levels 1
-sub_ksp_type preonly -threadcomm_nthreads $OMP_NUM_THREADS
-threadcomm_type openmp

But the program dies with this error:

[0]PETSC ERROR: --------------------- Error Message
--------------------------------------------------------------
[0]PETSC ERROR: Unknown type. Check for miss-spelling or missing
package: http://www.mcs.anl.gov/petsc/documentation/installation.html#external
[0]PETSC ERROR: Unable to find requested PetscThreadComm type openmp
[0]PETSC ERROR: See
http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.5.3-2611-g88a78ad
GIT Date: 2015-04-08 11:48:47 -0500
[0]PETSC ERROR: ../../solve-only on a arch-linux2-c-opt named r1i0n0
by lvella Thu Apr  9 18:16:26 2015
[0]PETSC ERROR: Configure options [...]
[0]PETSC ERROR: #1 PetscThreadCommSetType() line 506 in
/home/lvella/src/petsc/src/sys/threadcomm/interface/threadcomm.c
[0]PETSC ERROR: #2 PetscThreadCommWorldInitialize() line 1242 in
/home/lvella/src/petsc/src/sys/threadcomm/interface/threadcomm.c
[0]PETSC ERROR: #3 PetscGetThreadCommWorld() line 82 in
/home/lvella/src/petsc/src/sys/threadcomm/interface/threadcomm.c
[0]PETSC ERROR: #4 PetscCommGetThreadComm() line 117 in
/home/lvella/src/petsc/src/sys/threadcomm/interface/threadcomm.c
[0]PETSC ERROR: #5 PetscCommDuplicate() line 195 in
/home/lvella/src/petsc/src/sys/objects/tagm.c
[0]PETSC ERROR: #6 PetscHeaderCreate_Private() line 60 in
/home/lvella/src/petsc/src/sys/objects/inherit.c
[0]PETSC ERROR: #7 MatCreate() line 84 in
/home/lvella/src/petsc/src/mat/utils/gcreate.c
[0]PETSC ERROR:
------------------------------------------------------------------------
[...]

Lastly, I have tried with pthread:

export OMP_NUM_THREADS=2
./solve-only -ksp_type bcgs -pc_type asm -pc_asm_overlap 12
-pc_asm_type restrict -sub_pc_type ilu -sub_pc_factor_levels 1
-sub_ksp_type preonly -threadcomm_nthreads $OMP_NUM_THREADS
-threadcomm_type pthread

Now the program runs, but the CPU usage is never higher than 133%,
which is much lower than the expected (if I solve the same matrix with
Hypre + OpenMP, CPU usage peaks 200% most of the time).

So, what threading model gives best results in PETSc, OpenMP or
pthreads? And what I need to do use OpenMP? All my attempts were
single-threaded, but I will need to work with MPI, is it
straightforward to spawn many MPI processes (i.e. just put mpirun
before the command?)

Lastly, how to prevent PETSc to mess with CPU affinities, because I
implement affinity control myself inside my program.

-- 
Lucas Clemente Vella
lvella at gmail.com


More information about the petsc-users mailing list