[petsc-dev] Status of pthreads and OpenMP support

Fri Oct 26 07:55:34 CDT 2012

Hi

We have continued to work on threading using OpenMP on a branch of
petsc-3.3 so we can look at the performance issues rather than getting
bogged down on implementation issues. As has been highlighted in
previous emails the golden rule is to maintain data locality. As far as
we are concerned on a NUMA system for PETSc this means to use the first
touch page faulting strategy (done in PETSc automatically when
matrices/vertices are created) and secondly it is the user's
responsibility to specify a hard affinity so the Linux scheduler doesn't
move threads about and thereby mess up your data locality. If you don't
do this your effective memory bandwidth decreases dramatically. Now the
code is primarily limited by cache misses due to indirect addressing
into the vector in sparse mat mult for example.

We are finding the same is true for BAIJ. ie the hot spot in the loop is
the cache misses incurred on indirect addressing on the vector (although
blocking helps a lot with this), the threading works fine so long as
first touch is done within PETSc and hard affinity is used.

I'm not at all keen on using libnuma directly; for migrating pages etc.
For a start I don't think the profiling results motivate the effort and
additional code complexity once you have explicitly set the data
locality using first-touch/affinity. And even if we argue that migrating
pages is a good thing, then we have to ask at what level should it be
implemented. For example, there is already a lot of work under way to
support dynamic page migration in the Linux kernel (eg autoNUMA,
http://lwn.net/Articles/488709/).

Cheers
Gerard

Jed Brown emailed the following on 26/10/12 02:47:
> In general, getting good performance with threads requires a
> "friendly" environment. Many other systems use the "interleave all"
> approach where they view DRAM as being uniform access and pay the
> price for lack of locality. This is more forgiving (for both the
> programmer and the execution environment), but we do not believe it
> has long-term relevance. Instead, our model retains the semantic
> information about locality and, we believe, will support better
> implementations in the long term. Writing these implementations has
> not been fast so far, due to limited time and due to fine-tuning of
> the threading model.
>
> One challenge is that systems try to abstract physical memory to the
> point where portable code cannot determine where memory is mapped.
> This causes unpredictable performance. On Linux, we have libnuma,
> which gives us this access, but so far, we have not been using libnuma
> explicitly within PETSc. (This is partially because it's an extra
> dependency that is only sometimes available, so everything has to also
> "work" without it; partly because Shri does most local development on
> a Mac.) In any case, having the program automatically debug quirks in
> the execution environment as well as internal misuse in newly-written
> kernels is quite hard.
>
> We need to make this "defensive code" better, but if you are
> experimenting with the new threading interfaces, it's important to be
> familiar with NUMA tools like numastat and procfs, profiling tools,
> and debugging tools. Email trial-and-error performance analysis on
> unknown NUMA systems is an extremely time-consuming process. If at all
> possible, I encourage you to use these tools to figure out what is
> happening with threads spawned by libraries, determining where the
> physical pages have gone, etc. If you identify strange behavior due to
> an "unfriendly" environment, I encourage you to think about how that
> environment can be detected automatically.
>
> On Thu, Oct 25, 2012 at 8:24 PM, Shri <abhyshr at mcs.anl.gov
> <mailto:abhyshr at mcs.anl.gov>> wrote:
>
>     John, Dave,
>        As you've found out through your experimentation with PETSc's
>     threading interface, there are still a lot of improvements that we
>     do need to do before the threading support is stable. Your input
>     certainly helps us in improving the functionality. I'll look at
>     the logs you've sent, now that I have some time, and get back to you.
>
>