[petsc-dev] OpenMP/Vec

Tue Feb 28 14:18:24 CST 2012

On Tue, Feb 28, 2012 at 10:41, Lawrence Mitchell <lawrence.mitchell at ed.ac.uk
> wrote:

> It's possible that if one asks the gcc folk nicely enough that the
> overhead might go to zero.  Alternately, if it turns out that overheads are
> unavoidable, and we've gone down the macro-ification route anyway (see
> below) we could explicitly make a runtime decision based on numthreads in
> PETSc.  So that:
>
> PetscOMPParallelFor(..., numthreads)
> for ( ... )
>    ;
>
> would turn into
> if ( numthreads == 1 )
>    for ( ... )
>        ;
> else
> #pragma omp parallel for
>    for ( ... )
>        ;
>
> The branch should hopefully be cheaper than the thread sync overhead.
>

The branch is trivially cheap, the growth in binary size (having a threaded
and non-threaded) is possibly an issue.

> As far as attaching the threading information to the object goes, it would
> be nice to do so staying within the sequential implementation (rather than
> copying all the code as is done for pthreads right now). I'd envisage
> something like this:
>
> PetscPragmaOMPObject(vec, parallel)
> {
>   PetscPragmaOMP(for...)
>   for ( i = PetscVecLowerBound(vec); i < PetscVecUpperBound(vec); i++ )
>       ...;
> }
>
> Where one has:
>
> #if defined(PETSC_HAVE_OPENMP)
> #define PetscVecLowerBound(vec) compute_lower_bound
> #define PetscVecUpperBound(vec) compute_upper_bound
> #else
> #define PetscVecLowerBound(vec) 0
> #define PetscVecUpperBound(vec) vec->map->n
> #endif
>
> Or is this too ugly for words?
>

The reason for my magic was simply to avoid calling omp_get_num_threads()
and omp_get_thread_num() many times to make this determination. Note that
OpenMP implementations create functions (run by the threads) that capture
the local state and call these functions to find their work unit. Thus the
generated code is quite similar to PETSc's pthreads versions.

> So to take a single, representative, kernel we move from something like:
>
> PetscErrorCode VecConjugate_Seq(Vec xin)
> {
>  PetscScalar    *x;
>  PetscInt       n = xin->map->n;
>  PetscInt       i;
>  PetscErrorCode ierr;
>
>  PetscFunctionBegin;
>  ierr = VecGetArray(xin,&x);CHKERRQ(**ierr);
> #pragma omp parallel for schedule(static) default(none) private(i)
> shared(n, x)
>  for (i=0; i<n; i++) x[i] = PetscConj(x[i]);
>  ierr = VecRestoreArray(xin,&x);**CHKERRQ(ierr);
>  PetscFunctionReturn(0);
> }
>
> To something like this:
>
> PetscErrorCode VecConjugate_Seq(Vec xin)
> {
>    PetscScalar *x;
>    PetscInt i;
>    PetscErrorCode ierr;
>    PetscFunctionBegin;
>    ierr = VecGetArray(xin,&x);CHKERRQ(**ierr);
>    PetscPragmaOMPObject(xin, parallel private(i) shared(xin, x))
>    {
>        for (i=PetscVecLowerBound(xin); i<PetscVecUpperBound(xin); i++)
> x[i] = PetscConj(x[i]);
>    }
>    ierr = VecRestoreArray(xin,&x);**CHKERRQ(ierr);
>    PetscFunctionReturn(0);
> }
>
> I prefer inlining the loop bounds into macro/function calls at point of
> use, rather than relying on magic names from a macro definition, but I'm
> not completely wedded to the idea.
>

Not a big deal, we could optimize either variant.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120228/ab304191/attachment.html>