[petsc-dev] OpenMP/Vec

Tue Feb 28 10:41:02 CST 2012

Dear all,

I've been doing some of the programming associated with the work 
Gerard's describing, so here's an attempt of a bit of a response with 
some summary of where we'd go next.

On 28/02/12 04:39, Jed Brown wrote:
> On Mon, Feb 27, 2012 at 16:31, Gerard Gorman <g.gorman at imperial.ac.uk
> <mailto:g.gorman at imperial.ac.uk>> wrote:
>
>     I had a quick go at trying to get some sensible benchmarks for this but
>     I was getting too much system noise. I am particularly interested in
>     seeing if the overhead goes to zero if num_threads(1) is used.
>
> What timing method did you use? I did not see overhead going to zero
> when num_threads goes to 1 when using GCC compilers, but Intel seems to
> do fairly well.

It's possible that if one asks the gcc folk nicely enough that the 
overhead might go to zero.  Alternately, if it turns out that overheads 
are unavoidable, and we've gone down the macro-ification route anyway 
(see below) we could explicitly make a runtime decision based on 
numthreads in PETSc.  So that:

PetscOMPParallelFor(..., numthreads)
for ( ... )
     ;

would turn into
if ( numthreads == 1 )
     for ( ... )
         ;
else
#pragma omp parallel for
     for ( ... )
         ;

The branch should hopefully be cheaper than the thread sync overhead.

>     I'm surprised by this. I not aware of any compiler that doesn't have
>     OpenMP support - and then you do not actually enable OpenMP compilers
>     generally just ignore the pragma. Do you know of any compiler that does
>     not have OpenMP support which will complain?
>
>
> Sean points out that omp.h might not be available, but that misses the
> point. As far as I know, recent mainstream compilers have enough sense
> to at least ignore these directives, but I'm sure there are still cases
> where it would be an issue. More importantly, #pragma was a misfeature
> that should never be used now that _Pragma() exists. The latter is
> better not just because it can be turned off, but because it can be
> manipulated using macros and can be explicitly compiled out.
>
>     This may not be flexible enough. You frequently want to have a parallel
>     region, and then have multiple omp for's within that one region.
>
>
> PetscPragmaOMPObject(obj, parallel)
> {
> PetscPragmaOMP(whetever you normally write for this loop)
> for (....) { }
> ...
> and so on
> }

So this is easy to do as you say, the exact best way to do it will 
probably become clear when writing the code.

As far as attaching the threading information to the object goes, it 
would be nice to do so staying within the sequential implementation 
(rather than copying all the code as is done for pthreads right now). 
I'd envisage something like this:

PetscPragmaOMPObject(vec, parallel)
{
    PetscPragmaOMP(for...)
    for ( i = PetscVecLowerBound(vec); i < PetscVecUpperBound(vec); i++ )
        ...;
}

Where one has:

#if defined(PETSC_HAVE_OPENMP)
#define PetscVecLowerBound(vec) compute_lower_bound
#define PetscVecUpperBound(vec) compute_upper_bound
#else
#define PetscVecLowerBound(vec) 0
#define PetscVecUpperBound(vec) vec->map->n
#endif

Or is this too ugly for words?

Computing the lower and upper bounds could either be done for every 
loop, or, if it's only a property of the vector length, we could stash 
it in some extra slots at creation time.  For a first cut, computation 
of the upper and lower bounds will do exactly what a static schedule 
does (i.e. chunking by size/nthread).

>     I think what you describe is close to Fig 3 of this paper written by
>     your neighbours:
>     http://greg.bronevetsky.com/papers/2008IWOMP.pdf
>     However, before making the implementation more complex, it would be good
>     to benchmark the current approach and use a tool like likwid to measure
>     the NUMA traffic so we can get a good handle on the costs.
>
>
> Sure.
>
>     Well this is where the implementation details get richer and there are
>     many options - they also become less portable. For example, what does
>     all this mean for the sparc64 processors which are UMA.
>
>
> Delay to runtime, use an ignorant partition for UMA. (Blue Gene/Q is
> also essentially uniform.) But note that even with uniform memory, cache
> still makes it somewhat hierarchical.
>
>     Not to mention
>     Intel MIC which also supports OpenMP. I guess I am cautious about
>     getting too bogged down with very invasive optimisations until we have
>     benchmarked the basic approach which in a wide range of use cases will
>     achieve good thread/page locality as illustrated previously.
>
>
> I guess I'm just interested in exposing enough semantic information to
> be able to schedule a few different ways using run-time (or, if
> absolutely necessary, configure-time) options. I don't want to have to
> revisit individual loops.

So to take a single, representative, kernel we move from something like:

PetscErrorCode VecConjugate_Seq(Vec xin)
{
   PetscScalar    *x;
   PetscInt       n = xin->map->n;
   PetscInt       i;
   PetscErrorCode ierr;

   PetscFunctionBegin;
   ierr = VecGetArray(xin,&x);CHKERRQ(ierr);
#pragma omp parallel for schedule(static) default(none) private(i) 
shared(n, x)
   for (i=0; i<n; i++) x[i] = PetscConj(x[i]);
   ierr = VecRestoreArray(xin,&x);CHKERRQ(ierr);
   PetscFunctionReturn(0);
}

To something like this:

PetscErrorCode VecConjugate_Seq(Vec xin)
{
     PetscScalar *x;
     PetscInt i;
     PetscErrorCode ierr;
     PetscFunctionBegin;
     ierr = VecGetArray(xin,&x);CHKERRQ(ierr);
     PetscPragmaOMPObject(xin, parallel private(i) shared(xin, x))
     {
         for (i=PetscVecLowerBound(xin); i<PetscVecUpperBound(xin); i++) 
x[i] = PetscConj(x[i]);
     }
     ierr = VecRestoreArray(xin,&x);CHKERRQ(ierr);
     PetscFunctionReturn(0);
}

I prefer inlining the loop bounds into macro/function calls at point of 
use, rather than relying on magic names from a macro definition, but I'm 
not completely wedded to the idea.

Lawrence

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.