[petsc-dev] OpenMP/Vec

Jed Brown jedbrown at mcs.anl.gov
Sun Feb 26 14:32:36 CST 2012


On Sun, Feb 26, 2012 at 04:07, Gerard Gorman <g.gorman at imperial.ac.uk>wrote:
>
>
> I agree that different data sizes might require different approaches.
> One might consider this as part of an autotuning framework for PETSc.
>
> The cost of spawning threads is generally minimised through the use of
> thread pools typically used by OpenMP - i.e. you only have a one time
> cost associated with forking and joining threads. However, even with a
> pool there are still some overheads (e.g. scheduling chunks) which will
> effect you for small data sizes. I have not measured this myself
> (appending to the todo list) but it is frequently discussed, e.g.
>
> http://software.intel.com/en-us/articles/performance-obstacles-for-threading-how-do-they-affect-openmp-code/
> http://www2.fz-juelich.de/jsc/datapool/scalasca/scalasca_patterns-1.3.html


Any threading model of this sort should be implemented with thread pools,
but that's not what I'm concerned about. Just opening and closing a
parallel region costs a significant amount. With two threads, I see it
taking 1000 to 2000 cycles which is a sizeable fraction of a microsecond.
Note that 1 microsecond is a typical latency of a network, so this is
really a meaningful quantity. If I use OpenMP for AXPY, the crossover point
seems to be around 3000 double precision elements. This would be a typical
subdomain size after one coarsening of a 100k dof 3D subdomain using a
smoothed aggregation multigrid. This is smack in the middle of a typical
subdomain size for implicit methods, so in a very common scenario, it
doesn't make sense to use OpenMP threads for any vector operations on any
level except for the finest. Note that this is still without NUMA effects
which tend to make threads worse, unless allocation can be done so there is
no contention.


> I think you mean thread-pools, as are used for OpenMP. The same thing is
> done for pthreads (e.g.
> http://www.hlnum.org/english/projects/tools/threadpool/doc.html) and
> others.
>

No, I mean

int main(int argc,char **argv) {
#pragma omp parallel
  {
  // your entire program is threaded, most allocations are independent
  // control-flow is computed redundantly, reductions/synchronization are
explicit
  }
  return 0;
}

OpenMP isn't currently very well equipped to be used this way, but I think
this a better model for fine-grained parallelism than making the program
serial by default and opening a parallel region when you decide there is
enough work. The current solution is more pragmatic, but I don't see it
scaling nearly as well.


> We are using static schedules. This means that the chunk size =
> array_length/nthreads. Therefore, we can have bad page/thread locality
> at the start (ie malloc may have returned a pointer to the middle of a
> page already faulted and this is not necessary on the same memory node
> that thread 0 is located), and where chunks boundaries don't align with
> page boundaries, where the successive threads id's are on different
> memory nodes. I've attached a figure to fill in deficiencies in my
> explanation - it is based on an Intel Westmere with two sockets (and two
> memory nodes), 6 cores per socket, an array of 10000 doubles, and page
> sizes of 4096 bytes.
>

Thanks, that's all I was asking. If we could do independent allocation (as
in my extreme example above) or even simple padding, we could ensure that
each thread actually gets a local page.


> You can control the page fault at the start of the array by replacing
> malloc with posix_memalign, where the alignment is the page size. For
> the pages that stride chunks that have been allocated to threads on
> different sockets...you'd have to use gaps in your arrays or something
> similar to resolve this. I would do the first of these because it's
> easy. I don't know an easy way to implement the second so I'd be
> inclined that inefficiency unless profiling indicates it cannot be ignored.
>

You can change the default alignment (used inside PetscMallocAlign()) by
configuring --with-memalign=4096 (pull petsc-dev to not error for such huge
values). Note that this will also use a huge alignment for strings and
structures, so it'll be quite wasteful. But it's a cheap way to experiment.

It's on Barry's favourite collaborative software development site of
> course ;-)
>
> https://bitbucket.org/wence/petsc-dev-omp/overview


Aha, thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120226/3cc77dbd/attachment.html>


More information about the petsc-dev mailing list