[petsc-dev] OpenMP/Vec

Sun Feb 26 18:39:03 CST 2012

On Sun, Feb 26, 2012 at 04:07, Gerard Gorman <g.gorman at imperial.ac.uk>wrote:

> > Did you post a repository yet? I'd like to have a look at the code.
>
> It's on Barry's favourite collaborative software development site of
> course ;-)
>
> https://bitbucket.org/wence/petsc-dev-omp/overview
>

I looked through the code and I'm concerned that all the OpenMP code is
inlined into vec/impls/seq and mat/impls/aij/seq with, as far as I can
tell, no way to use OpenMP for some objects in a simulation, but not
others. I think that all the pragmas should have num_threads(x->nthreads)
clauses. We can compute the correct number of threads based on sizes when
memory is allocated (or specified through command line options, inherited
from related objects, etc).

I don't think we can get away from OpenMP schedule overhead (several
hundred cycles) even for those objects that we choose not to use threads
for, but (at least with my gcc tests), that overhead is only about a third
of the cost of actually starting a parallel region.

It's really not acceptable to insert unguarded

#pragma omp ...

into the code because this will generate tons of warnings or errors with
compilers that don't know about OpenMP. It would be better to test for
_Pragma and use

#define PetscPragmatize(x) _Pragma(#x)
#if defined(PETSC_USE_OPENMP)
#  define PetscPragmaOMP(x) PetscPragmatize(omp x)
#else
#  define PetscPragmaOMP(x)
#endif

then use

PetscPragmaOMP(parallel for ...)

We should probably use a variant for object-based threading

#define PetscPragmaOMPObject(obj,x) PetscPragmaOMP(x
num_threads((obj)->nthreads))

In the case of multiple objects, I think you usually want the object being
modified to control the number of threads.

In many cases, I would prefer more control over the partition of the loop.
For example, in many cases, I'd be willing to tolerate a slight
computational imbalance between threads in exchange for working exclusively
within my page. Note that the arithmetic to compute such things is orders
of magnitude less expensive than the schedule/distribution to threads. I
don't know how to do that except to

PragmaOMP(parallel) {
  int nthreads = omp_get_num_threads();
  int tnum = omp_get_thread_num();
  int start,end;
  // compute start and end
  for (int i=start; i<end; i++) {
    // the work
  }
}

We could perhaps capture some of this common logic in a macro:

#define VecOMPParallelBegin(X,args) do { \
  PragmaOMPObject(X,parallel args) { \
  PetscInt _start, _end; \
  VecOMPGetThreadLocalPart(X,&_start,&_end); \
  { do {} while(0)

#define VecOMPParallelEnd() }}} while(0)

VecOMPParallelBegin(X, shared/private ...);
{
  PetscInt i;
  for (i=_start; i<_end; i++) {
    // the work
  }
}
VecOMPParallelEnd();

That should reasonably give us complete run-time control of the number of
parallel threads per object and their distribution, within the constraints
of contiguous thread partition. That also leaves open the possibility of
using libnuma to query and migrate pages. (For example, a short vector that
needs to be accessed from multiple NUMA nodes might intentionally be
faulted with pages spread apart even though other vectors of similar size
might be accessed from within one NUMA nodes and thus not use threads at
all. (One 4 KiB page is only 512 doubles, but if the memory is local to a
single NUMA node, we wouldn't use threads until the vector length was 4 to
8 times larger.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120226/164b1637/attachment.html>