[petsc-dev] Status of pthreads and OpenMP support

Nystrom, William D wdn at lanl.gov
Thu Oct 25 16:21:05 CDT 2012


Hi John,

I have also been trying to test out the pthreads and openmp support in petsc-dev.  I've attached
a gzipped tarball of some of my test results.  I've been running the ex2.c example test problem
located in the petsc-dev/src/ksp/ksp/examples/tutorials directory.  I've been testing on a machine
where each node has two 8 core Sandybridge xeons.  I've been indentifying and reporting some
issues.  For instance, if I use one of my builds of petsc-dev that builds several external packages,
then I have a really slow run when using "-threadcomm_nthreads 1 -threadcomm_type pthread".
However, it seems to run fine when setting "-threadcomm_nthreads" to values from 2 to 16.  If I
build a version of petsc-dev that does not use any external packages, this slowness problem seems
to go away.  However, it comes back if I then add in the use of MKL in my petsc-dev build.  So
it looks like there might be some interaction problem when building petsc-dev to use MKL for
blas/lapack.

For my openmp runs, I had to set the GOMP_CPU_AFFINITY environment variable like so:

export GOMP_CPU_AFFINITY=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

Then, I got openmp results that were nearly the same as my pthread results.  Also, note that
for my pthread results, I was also using the "-threadcomm_affinities" option.  I also ran a couple
of cases where I used 2 mpi processes, one per cpu socket.  That initially ran really slow until
I used a script to let numactl control the thread affinities.  The name of that script is
runPetscProb_pthreads in my attached tarball.  To run that script, I used the following mpirun
command:

/usr/bin/time -p mpirun -np 2 -npernode 16 -mca pml cm ./runPetscProb_pthreads 1000 1000 >& ex2_pthread_1000_1000_16_np_2.log

I also am only seeing a speedup of 5x or so when comparing 16 threads to one thread.  Not
sure why.  On this same machine, I was seeing a speedup of 14x or more using openmp and
16 threads on a simple 2d explicit hydro code.  But maybe it is just the limitations of memory
bandwidth as you mentioned.

I also get about the same performance results on the ex2 problem when running it with just
mpi alone i.e. with 16 mpi processes.

So from my perspective, the new pthreads/openmp support is looking pretty good assuming
the issue with the MKL/external packages interaction can be fixed.

I was just using jacobi preconditioning for ex2.  I'm wondering if there are any other preconditioners
that might be multi-threaded.  Or maybe a polynomial preconditioner could work well for the
new pthreads/openmp support.

I'd be interested in hearing about your further experiences as you explore this more.

Dave

BTW, the results I shared in my tarball are for my stripped down build of petsc-dev that does not
use external packages other than mpi.

--
Dave Nystrom
LANL HPC-5
Phone: 505-667-7913
Email: wdn at lanl.gov
Smail: Mail Stop B272
       Group HPC-5
       Los Alamos National Laboratory
       Los Alamos, NM 87545


________________________________________
From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of John Fettig [john.fettig at gmail.com]
Sent: Thursday, October 25, 2012 11:53 AM
To: For users of the development version of PETSc
Subject: [petsc-dev] Status of pthreads and OpenMP support

I'm curious about the status of an SMP or hybrid SMP/DMP PETSc
library.  What is the tentative timeline?  Will there be functional
support for threads in the next release?

I built petsc-dev with pthreads and OpenMP enabled via --with-openmp=1
--with-pthreadclasses=1 and added PETSC_THREADCOMM_ACTIVE to
$PETSC_ARCH/include/petscconf.h.  My machine is a dual Westmere
system, with 2x 6 core CPU's.  Then I ran the example given by the
installation documentation:

mpirun -np $i ./ex19 -threadcomm_type {openmp,pthread}
-threadcomm_nthreads $j -pc_type none -da_grid_x 100 -da_grid_y 100
-log_summary -mat_no_inode -preload off

I've attached the log_summary output using openmp, with np=1,2 and
nthreads=1,2,4,6.  With openmp, the speedup is 5.358e+00/9.138e-01 =
5.9 going from 1 process, 1 thread to 2 processes, 6 threads each.

With pthreads something is clearly not working as designed as the time
for two threads is 44x slower than the serial time.  I've attached the
log summary for 1 to 6 threads.

With non-threaded PETSc, I typically see ~50% parallel efficiency on
all cores for CFD problems.  Is it wrong for me to hope that a
threaded version can improve this? Or should I be satisfied that I
seem to be achieving about (memory bandwidth)/6 total performance out
of each socket?

Thanks,
John
-------------- next part --------------
A non-text attachment was scrubbed...
Name: petsc_dev_thread_support_tests.tar.gz
Type: application/x-gzip
Size: 35102 bytes
Desc: petsc_dev_thread_support_tests.tar.gz
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121025/7867a6c5/attachment.gz>


More information about the petsc-dev mailing list