[petsc-dev] OpenMP/Vec

Gerard Gorman g.gorman at imperial.ac.uk
Sun Feb 26 04:07:08 CST 2012


Hi Jed

Many thanks for your feedback.

Jed Brown emailed the following on 24/02/12 04:06:
> On Thu, Feb 16, 2012 at 10:33, Gerard Gorman <g.gorman at imperial.ac.uk
> <mailto:g.gorman at imperial.ac.uk>> wrote:
>
>     2. For GCC I did not use a threaded blas. For Intel I used
>     -lmkl_intel_thread. However, it appears dnrm2 is not threaded. It
>     seems
>     to be a common feature among other threaded blas libraries that
>     Level 1
>     is not completely threaded (e.g. cray). Unfortunately most of this is
>     experience/anecdotal information. I do not know of any proper
>     survey. We
>     have the option here of either rolling our own or ignoring the issue
>     until profiling shows it is a problem...and eventually someone
>     else will
>     release a fully threaded blas.
>
>
> Even much of BLAS2 is not threaded in MKL. Note that opening a
> parallel region is actually quite expensive, so (perhaps ironically),
> MPI is expected to perform better than threading when parallel regions
> involve relatively little work. In the case of BLAS-1, only quite
> large sizes could possibly pay for spawning a parallel region.

I agree that different data sizes might require different approaches.
One might consider this as part of an autotuning framework for PETSc.

The cost of spawning threads is generally minimised through the use of
thread pools typically used by OpenMP - i.e. you only have a one time
cost associated with forking and joining threads. However, even with a
pool there are still some overheads (e.g. scheduling chunks) which will
effect you for small data sizes. I have not measured this myself
(appending to the todo list) but it is frequently discussed, e.g.
http://software.intel.com/en-us/articles/performance-obstacles-for-threading-how-do-they-affect-openmp-code/
http://www2.fz-juelich.de/jsc/datapool/scalasca/scalasca_patterns-1.3.html

>
> This is why I think the long-term solutions for threading involve
> long-lived threads with mostly-private memory that prefer redundant
> computation so that you only have to pay for synchronization instead
> of also having to pay dearly to use interfaces. Unfortunately, I think
> the current threaded programming models are challenging to use in this
> way and it imposes some extra complexity on users.

I think you mean thread-pools, as are used for OpenMP. The same thing is
done for pthreads (e.g.
http://www.hlnum.org/english/projects/tools/threadpool/doc.html) and others.


>  
>
>
>     3. Comparing intel/ and intel-pinning/ is particularly interesting.
>     "First touch" has been applied to all memory in VecCreate so that
>     memory
>     should be paged correctly for NUMA. But first touch does not gain you
>     much if threads migrate, so for the intel-pinning/ results I set
>     the env
>     KMP_AFFINITY=scatter to get hard affinity. You can clearly from the
>     results that this improves parallel efficiency by a few percentage
>     points in many cases. It also really smooths out efficiency dips
>     as you
>     run on different number of threads.
>
>
> Are you choosing sizes so that thread partitions always fall on a page
> boundary, or are some pages cut irregularly?

We are using static schedules. This means that the chunk size =
array_length/nthreads. Therefore, we can have bad page/thread locality
at the start (ie malloc may have returned a pointer to the middle of a
page already faulted and this is not necessary on the same memory node
that thread 0 is located), and where chunks boundaries don't align with
page boundaries, where the successive threads id's are on different
memory nodes. I've attached a figure to fill in deficiencies in my
explanation - it is based on an Intel Westmere with two sockets (and two
memory nodes), 6 cores per socket, an array of 10000 doubles, and page
sizes of 4096 bytes.

You can control the page fault at the start of the array by replacing
malloc with posix_memalign, where the alignment is the page size. For
the pages that stride chunks that have been allocated to threads on
different sockets...you'd have to use gaps in your arrays or something
similar to resolve this. I would do the first of these because it's
easy. I don't know an easy way to implement the second so I'd be
inclined that inefficiency unless profiling indicates it cannot be ignored.



>  
>
>
>     Full blown benchmarks would not make a lot of sense until we get
>     the Mat
>     classes threaded in a similar fashion. However, at this point I would
>     like feedback on the direction this is taking and if we can start
>     getting code committed.
>
>
> Did you post a repository yet? I'd like to have a look at the code.

It's on Barry's favourite collaborative software development site of
course ;-)

https://bitbucket.org/wence/petsc-dev-omp/overview

Cheers
Gerard

-------------- next part --------------
A non-text attachment was scrubbed...
Name: array_page_threads.pdf
Type: application/pdf
Size: 19485 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120226/8ebad503/attachment.pdf>


More information about the petsc-dev mailing list