[petsc-dev] Status of pthreads and OpenMP support
Nystrom, William D
wdn at lanl.gov
Thu Oct 25 16:21:05 CDT 2012
Hi John,
I have also been trying to test out the pthreads and openmp support in petsc-dev. I've attached
a gzipped tarball of some of my test results. I've been running the ex2.c example test problem
located in the petsc-dev/src/ksp/ksp/examples/tutorials directory. I've been testing on a machine
where each node has two 8 core Sandybridge xeons. I've been indentifying and reporting some
issues. For instance, if I use one of my builds of petsc-dev that builds several external packages,
then I have a really slow run when using "-threadcomm_nthreads 1 -threadcomm_type pthread".
However, it seems to run fine when setting "-threadcomm_nthreads" to values from 2 to 16. If I
build a version of petsc-dev that does not use any external packages, this slowness problem seems
to go away. However, it comes back if I then add in the use of MKL in my petsc-dev build. So
it looks like there might be some interaction problem when building petsc-dev to use MKL for
blas/lapack.
For my openmp runs, I had to set the GOMP_CPU_AFFINITY environment variable like so:
export GOMP_CPU_AFFINITY=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
Then, I got openmp results that were nearly the same as my pthread results. Also, note that
for my pthread results, I was also using the "-threadcomm_affinities" option. I also ran a couple
of cases where I used 2 mpi processes, one per cpu socket. That initially ran really slow until
I used a script to let numactl control the thread affinities. The name of that script is
runPetscProb_pthreads in my attached tarball. To run that script, I used the following mpirun
command:
/usr/bin/time -p mpirun -np 2 -npernode 16 -mca pml cm ./runPetscProb_pthreads 1000 1000 >& ex2_pthread_1000_1000_16_np_2.log
I also am only seeing a speedup of 5x or so when comparing 16 threads to one thread. Not
sure why. On this same machine, I was seeing a speedup of 14x or more using openmp and
16 threads on a simple 2d explicit hydro code. But maybe it is just the limitations of memory
bandwidth as you mentioned.
I also get about the same performance results on the ex2 problem when running it with just
mpi alone i.e. with 16 mpi processes.
So from my perspective, the new pthreads/openmp support is looking pretty good assuming
the issue with the MKL/external packages interaction can be fixed.
I was just using jacobi preconditioning for ex2. I'm wondering if there are any other preconditioners
that might be multi-threaded. Or maybe a polynomial preconditioner could work well for the
new pthreads/openmp support.
I'd be interested in hearing about your further experiences as you explore this more.
Dave
BTW, the results I shared in my tarball are for my stripped down build of petsc-dev that does not
use external packages other than mpi.
--
Dave Nystrom
LANL HPC-5
Phone: 505-667-7913
Email: wdn at lanl.gov
Smail: Mail Stop B272
Group HPC-5
Los Alamos National Laboratory
Los Alamos, NM 87545
________________________________________
From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of John Fettig [john.fettig at gmail.com]
Sent: Thursday, October 25, 2012 11:53 AM
To: For users of the development version of PETSc
Subject: [petsc-dev] Status of pthreads and OpenMP support
I'm curious about the status of an SMP or hybrid SMP/DMP PETSc
library. What is the tentative timeline? Will there be functional
support for threads in the next release?
I built petsc-dev with pthreads and OpenMP enabled via --with-openmp=1
--with-pthreadclasses=1 and added PETSC_THREADCOMM_ACTIVE to
$PETSC_ARCH/include/petscconf.h. My machine is a dual Westmere
system, with 2x 6 core CPU's. Then I ran the example given by the
installation documentation:
mpirun -np $i ./ex19 -threadcomm_type {openmp,pthread}
-threadcomm_nthreads $j -pc_type none -da_grid_x 100 -da_grid_y 100
-log_summary -mat_no_inode -preload off
I've attached the log_summary output using openmp, with np=1,2 and
nthreads=1,2,4,6. With openmp, the speedup is 5.358e+00/9.138e-01 =
5.9 going from 1 process, 1 thread to 2 processes, 6 threads each.
With pthreads something is clearly not working as designed as the time
for two threads is 44x slower than the serial time. I've attached the
log summary for 1 to 6 threads.
With non-threaded PETSc, I typically see ~50% parallel efficiency on
all cores for CFD problems. Is it wrong for me to hope that a
threaded version can improve this? Or should I be satisfied that I
seem to be achieving about (memory bandwidth)/6 total performance out
of each socket?
Thanks,
John
-------------- next part --------------
A non-text attachment was scrubbed...
Name: petsc_dev_thread_support_tests.tar.gz
Type: application/x-gzip
Size: 35102 bytes
Desc: petsc_dev_thread_support_tests.tar.gz
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121025/7867a6c5/attachment.gz>
More information about the petsc-dev
mailing list